**Texts in Quantitative Political Analysis** *Series Editor:* Justin Esarey

# Alessia Damonte Fedra Negri *Editors*

# Causality in Policy Studies A Pluralist Toolbox

# **Texts in Quantitative Political Analysis**

#### **Series Editor**

Justin Esarey, Dept of Politics, FM Kirby Hall 319 Wake Forest University Winston Salem, NC, USA

This series covers the novel application of quantitative and mathematical methods to substantive problems in political science as well as the further extension, development, and adaptation of these methods to make them more useful for applied political science researchers. Books in this series make original contributions to political methodology and substantive political science, while serving as educational resources for independent practitioners and analysts working in the feld.

This series flls the needs of faculty, students, and independent practitioners as they develop and apply new quantitative research techniques or teach them to others. Books in this series are designed to be practical and easy-to-follow. Ideally, an independent reader should be able to replicate the authors' analysis and follow any in-text examples without outside help. Some of the books will focus largely on instructing readers how to use software such as R or Stata. For textbooks, example data and (if appropriate) software code will be supplied by the authors for readers.

This series welcomes proposals for monographs, edited volumes, textbooks, and professional titles.

Alessia Damonte • Fedra Negri Editors

# Causality in Policy Studies

A Pluralist Toolbox

*Editors* Alessia Damonte Social and Political Sciences University of Milan MILANO, Milano, Italy

Fedra Negri University of Milan Milano, Italy University of Milan- Bicocca Milano, Italy

This book is an open access publication.

ISSN 2730-9614 ISSN 2730-9622 (electronic) Texts in Quantitative Political Analysis ISBN 978-3-031-12981-0 ISBN 978-3-031-12982-7 (eBook) https://doi.org/10.1007/978-3-031-12982-7

© The Editor(s) (if applicable) and The Author(s) 2023

**Open Access** This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specifc statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

## **Preface**

How can we think of causation in policy research? Almost any research tradition provides a different answer. For instance, emphasis can be placed either on the process leading to a policy outcome or on its underlying conditions. A process can be either observable or unobservable, and the underlying relevant conditions can be understood as single factors or complex confgurations. Either samples, populations, or single cases can be invoked as the proper empirical ground for grasping them. Evidence can be arranged to either claim relevance or irrelevance. These differences refect as many distinct assumptions about the shape of causation and build as many research strategies.

*Causality in Policy Studies* equips researchers to meet two related challenges in the feld. First, algorithms for data analysis embed selected assumptions about causation that often remain unspoken. Knowing these assumptions is crucial to understanding how algorithms can be appropriately employed and eventually combined to compensate for their blind spots and weaknesses. Second, policy research is carried out within various disciplines (such as political science, sociology, economics, management, and administration), each often married to particular traditions. The book addresses the technical drive of such differentiation. In doing so, it provides the opportunity for researchers of any stripe to familiarize themselves with the strategies on which other streams build their claims.

In short, the book shows how to learn from different causal techniques, apply them consciously, and possibly make them speak to each other to get a better sense of fndings. For this purpose, it structures the journey into causal knowledge in three stages. First, it introduces the foundational issues of causation (Chaps. 1 and 2). Then, it exposes the inner working of selected techniques for causal analysis (Chaps. 3, 4, 5, 6, 7, 8 and 9). Last, it considers some incompatibilities and complementarities among techniques to improve causal knowledge (Chaps. 10 and 11).

The red thread connecting all chapters is a reasonable realist stance. All share the tenets that causation is factual and entails generative and transfer processes unfolding at different levels of reality. Moreover, the chapters agree that causation can be known. Hypothetical statements about its manifestations, direction, and conditions can be given a testable shape. They also agree that causal statements should be

believed when logically and empirically compelling. The book's commitment to methodological pluralism follows from these tenets. The complexity of causal phenomena is such that no single technique can grasp its entirety. Still, each technique can illuminate particular facets in response to a precise research question. Indeed, asking whether a factor can yield one outcome differs from asking how it happens or under which conditions it obtains, and each response calls for adequate analytic tools. When pieced together, these responses can offer a better account of the phenomena of interest.

Methodological pluralism can deliver on the promise of better knowledge if the strengths and weaknesses of each technique are understood and tackled. To this end, each substantive chapter clarifes the research question a technique can answer, the research design and data treatment the technique requires for credible results, and the domain of validity of its fndings. Wherever possible, a replicable example illustrates the deployment of the analysis as the sequence of operations and actual decisions. Of course, this selection of techniques is far from exhaustive of the methodological variety of policy studies. Nevertheless, this suite provides sharp insight into the different strategies to establish the tenability of a causal statement. As such, it can offer guidance beyond the boundaries of this book.

The edited format of the book aims at providing highly usable and solid knowledge for policy assessment and evaluation to MA students, PhD students, scholars, and practitioners in policy-related felds. Thus, each chapter is authored by a recognized scholar from different backgrounds, generations, and perspectives. Such a diverse yet "close-knit" team is essential to the volume. A single author could hardly have covered such a range of techniques with comparable expertise.

Public policies are tools and governance systems to tackle collective problems. Good policies call for a generation of open-minded scholars and practitioners willing to understand and learn from research conducted in different felds and capable of handling the techniques in their toolbox consciously and carefully. We hope you will have a good time going through the chapters. Enjoy your journey!

MILANO, Milano, Italy Alessia Damonte

Fedra Negri

## **Acknowledgments**

Every book is a collective enterprise and some more than others. Our frst thanks go to the many students who have compelled this project and shaped it during our courses. They have been the real drivers of this effort, and we are more than grateful for how they kept our motivation high over the many months of writing and revising. Heartfelt thanks also go to Licia Papavero, Francesco Zucchini, and the board of the Ph.D. in Political Studies of the University of Milan. Their constant support to the Summer School in "Research Strategies in Policy Studies" (ReSPoS) has proven vital to the maturation of this project, which consolidates an experience dating back to 2013—now, a seemingly distant past. The School and the project, in turn, would not have been possible without the fnancial contributions of the Compagnia di San Paolo (Turin, Italy) through the Network for the Advancement of Social and Political Sciences (NASP) directed by Maurizio Ferrera. To them go our sincere gratitude. We also are greatly indebted to Springer's Senior Editor for Economics, Political Science, and Public Administration, Lorraine Klimowich, and the Editor of the 'Textbook on Political Analysis' series, Justin Esarey. Their precious suggestions and faultless encouragement have been fundamental to fnalizing a project intentionally positioned at the crossroad of many disciplines and research standards. Our further debt of gratitude is owed to Luigi Curini and the Standing Groups "MetRiSP—Research Methods for Political Science" and "Political Science & Public Policy" of the SISP—Italian Society of Political Science. We have treasured their feedbacks on earlier versions and their backing the main idea. Finally, we gratefully acknowledge the fnancial support of Protego—an advanced project funded by the European Research Council—Grant agreement n°694632.

University of Milan Alessia Damonte Milan, Italy Fedra Negri

## **Contents**



# **Chapter 1 Introduction: The Elephant of Causation and the Blind Sages**

**Alessia Damonte and Fedra Negri**

*It was six men of Indostan, To learning much inclined, Who went to see the Elephant (Though all of them were blind), That each by observation Might satisfy his mind. John G. Saxe (1816–1887).*

**Abstract** What does a policy outcome hinge on? The response is vital to policymaking and calls for the best of our knowledge from a variety of disciplines—from economics to sociology and from political science to public administration and management. The response entails a stance about causation, however, and almost every discipline has its own. Researchers are like the blind sages who had never come across the elephant of causation before and who develop their idea of the elephant by "touching" a different part of it. Which part of the elephant will you happen to touch? Will you be able to listen to and understand what the other sages will tell you?

#### **1.1 Policy Decisions and Causal Theories**

The common wisdom about public policy understands them as governments' decisions to tackle a collective problem. These decisions deploy rules, information, taxes, and expenditures to get "people to do things that they might not otherwise do" or "do things that they might not have done otherwise" (Schneider & Ingram, 1990: 513). By inducing a change in people's willingness and capacity to "do things," policy-makers expect the problem to disappear or, at least, take a more bearable shape.

A. Damonte (\*)

University of Milan, Milan, Italy e-mail: alessia.damonte@unimi.it

F. Negri University of Milan - Bicocca, Milan, Italy e-mail: fedra.negri@unimib.it

Thus, the kernel of policy decisions is the causal theory that they encapsulate: frst, of the behavior at the root of the collective problem; second and relatedly, of the capacity that certain tools have to make such behavior change for the better. The theory connects outcomes to behavior and then identifes the "carrots, sticks, and sermons" (Vedung, 2010) best suited to put or keep such behavior on a desirable track. For example, in their fght against cancer, governments can address smoking as a proven causal factor and assume people smoke if they have the wrong information or are shortsighted about the consequences of their behavior—else, they would reasonably quit. Governments can fund education campaigns to convey the right information, require tobacco products to carry warning labels, or disallow tobacco advertising and sponsorship. Moreover, to compensate for people's shortsightedness, they can levy "sin taxes" upon tobacco products to make prices a better signal of the hidden costs of smoking or enforce smoke bans that protect non-smokers. Whether a government applies none, one, or a mix of these tools, in turn, depends on policy-makers; whether their decisions reach the addressees properly, instead, is an administrative and a governance matter (e.g., McConnell, 2010). Regardless of the point of attack, the issue of policy success and failure inevitably appeals to causal theories on endowments, concerns, constraints, and incentives accounting for behavior (e.g., Ostrom, 2005).

Policy studies offer exemplary illustrations of the twofold stake of causal theories. First, these theories allow us to make sense of the world. Our bewilderment at some diversity in performance dissolves when we are offered satisfying accounts of relevant behaviors. Second, these theories have straightforward practical implications for individual and collective strategies. If we know which factors compel an event and suppress it, we can change the event's odds by controlling these factors. Then, the driving question remains: how can we get to know these factors well enough to build decisions on them?

#### **1.2 The Elephant of Causation**

Across the philosophy of science and social sciences, the responses to this question invite analogies with the blind sages in Saxe's poem (1872), who "prate about an Elephant that / Not one of them has seen."1 Indeed, actual causation is the complex local production of an outcome and it is hard to identify before it unfolds. The usable knowledge of a causal process pinpoints the key factors of its unfolding that allow us to see it coming in the next instance and, eventually, change its odds (e.g., Craver and Kaplan, 2020). Such knowledge requires criteria to identify the key

<sup>1</sup>The poem tells the story of a group of blind sages who have never come across an elephant before and who learn what the elephant is like by touching it. Each blind sage feels a different part of the elephant's body, but only one part. They then describe the elephant based on their limited experience and "Though each was partly in the right, /And all were in the wrong!" (Saxe, 1872).

causal factors beyond the single case and credibly so. Historically, guidelines for identifying the key causal factors developed along two lines.

#### *1.2.1 Elephants by the Principle*

The most enduring guideline for determining the key causal factors before a process unfolds has come from the Aristotelian philosophy of science. There, causation was tracked back to four kinds of principles, known as "material," "formal," "effcient," and "fnal." The frst two principles capture the structural features of a causal process, namely, its constituent elements and the shape of their arrangement. The latter two refer to agency and locate the key factors in outer stimuli or the drive from inner purposes (e.g., Moravcsik, 1974). The original "doctrine" maintained that adequate responses to any why-question appealed to all the four principles together.

Indeed, convincing accounts still locate actual causation in the interplay of structure and agency, as infuential mechanistic perspectives make clear (e.g., Little, 2011; Craver, 2006). More often, current research streams specialize in single principles. For example, the causal role of "material" ascriptive features is a driving concern of gender and minority studies. The generative power of formal arrangements is the core tenet of, for instance, game theories. Studies on expected utility, values, habits, and emotions take heed of the fnal goals and motivations, providing fundamental assumptions for neo-institutionalist and behavioral approaches of various stripes. Effcient factors are any stimulus, intervention, or treatment that can elicit a response; thus, they are central to theories of policy instruments, regimes, or political communication, among many others.

With some exceptions (e.g., Bache et al., 2012; Kurki, 2006), current theories seldom claim an explicit legacy with the original canon. The doctrine has fallen into disrepute as improperly scientifc, because it invoked a metaphysical reason to justify the causal standing of its four principles. The tenet that individuals with similar features, in a similar situation, with similar motivations, under equivalent stimuli did and will behave in similar ways was justifed by the belief that all embodied the same metaphysical essence. As Aristotle argued in a seminal fragment, planets do not twinkle because planets are near things, and not twinkling was intrinsic to near things. Thus, the next planet will not twinkle, too, in force of its "near-thingness."

This line of reasoning easily lends itself to circular arguments that restate general assumptions instead of probing them. As late as 1673, Molière still had reasons to satirize it. In his comedy *The Hypochondriac*, a "docto doctore" explains in dog Latin that opium makes people sleepy because it embodies a "dormitive virtue." However, the ultimate criticism came from the British Empiricists, who saw in the appeal to essences a mode for preserving beliefs against evidence and a fundamental obstacle to progress and learning.

#### *1.2.2 Elephants by the Rules*

The rejection of metaphysical warrants has called for a different ground for causal inference. Whether a reliable connection exists between being a near thing and not twinkling across cases, so the argument goes, it can only be decided empirically.

Yet, causal evidence does not come to us with labels and numbers attached. Assumptions are still needed about the empirical traces that distinguish between relevant and irrelevant causal factors. In Hume's much-quoted words, causally relevant is:

an object followed by another and where all the objects, similar to the frst, are followed by objects similar to the second. Or, in other words, where, if the frst object had not been, the second never had existed. (Hume, 1748, Section VII, Part II, §60).

In short, a factor is relevant to an outcome in the single case under two warrants: the association of the two conforms to a *regular* pattern, and it supports *counterfactual* reasoning.

#### **1.2.2.1 Regularity**

The regularity warrant—"where all the objects, similar to the frst, are followed by objects similar to the second"—renders the empirical footprint of Aristotelian essences without assuming them and builds on the repeated observation of similar occurrences.

All objects sharing the same feature are similar and constitute a distinct class. Regularity, then, is established between objects in different classes—for instance, in the class of "swan" and in the class of "white." It requires that any observation of the frst class entails one in the second. When the regularity holds, causal knowledge can be circulated through handy formulae such as "if a swan, then white."

To apply to the next instance, these formulae have to prove faultless, which is hardly the case: classes and gauges are human constructs and can prove too strict or liberal to capture actual causation in the next instance. Hence, regularity holds provisionally only until we meet the black swan that forces a revision of the scope of our regularity tenets.

Regularity may also seem perfect just because we measured two consequences of the same process. These relationships are useful for prediction; however, they do not qualify as causal as they do not grant control over the events' odds as desired in public policy. Indeed, a barometric reading can be relied upon to prepare for extreme weather conditions but does not license the belief that the coming storm can be tamed by forcing the barometer's pointer. Thus, regularity can be a necessary trait of usable knowledge but insuffcient to declare the causal standing of a relationship.

#### **1.2.2.2 Counterfactual**

The counterfactual—"where, if the frst object had not been, the second never had existed"*—*enters the picture as the additional warrant to establish causal relevance and ideally applies to the factor in the single case independent of regularity. The warrant borrows from the classical rules of argumentation and the indirect proofs in geometric demonstrations; however, it displays an empirical edge. Counterfactuals link causal relevance to evidence that we could compel a change in the second object by manipulating the frst.

From the Humean defnition, manipulation is usually understood as suppression; more generally, it means switching the observed state of a feature into its opposite. Thus, counterfactual reasoning requires, frst, that we imagine the frst object with the switched feature and, then, that we can only draw impossible or contradictory conclusions from it (e.g., Levi, 2007). An exemplary illustration comes directly from Hume. Despite his deep skepticism toward the human mind's ability to fully understand causation, he conceded that our intuitions must be somehow right. To justify his claim, he reasoned that had our mind always got causation wrong (switching the feature), then humankind would have long gone extinct (drawing a conclusion), which contrasts with us thriving as a species (showing the conclusion absurd). Such counterfactual criterion improves on the regularity test, as regular non-causal features fail it: as a broken barometer cannot stop a storm, it cannot be recognized as having any causal standing.

However, counterfactuals have their limits, too. First, they cannot be established unless all the plausible alternative causes of the same outcome are ruled out. Hume's argument does not exclude that humankind's evolutionary success instead depends on, for instance, sheer luck—and the unaccounted alternative undermines the cogency of its conclusion. The second and related issue is serious to the point of earning the title of "fundamental problem of causal inference" in some quarters (e.g., Holland, 1988). Unless we cast the same causal process in the same unit with and without the feature of interest, we cannot establish whether switching the feature can change the outcome.

#### **1.3 The Blind Sages' Portrayals as the Book's Blueprint**

The criteria to establish causation by regularity and counterfactual evidence seem as straightforward as impossible to meet. Nevertheless, techniques have been developed as strategies to circumvent the Humean paradoxes and provide empirical warrants to the claim of causal relevance. As Little shows in Chap. 2, technical specialization has undermined the dialogue among techniques and their fndings. The appeal to regularity, counterfactual, or mechanistic principles has turned into as many ultimate understandings of causation: "laws" and counterfactuals offered a rival ground for experimental practices; mechanisms took distances from both and licensed causal analysis in actual cases only, under consideration that any conclusion about aggregates necessarily entails an unfaithful reduction—in the end, all models are wrong.

However, the possibility of integration remains when techniques commit to three considerations and are consistent with a reasonable scientifc realism. First, causation is real, but our best knowledge of it remains a useful approximation. Second, regularity and counterfactuals are epistemic criteria to establish whether portrayals qualify as valid causal accounts; mechanisms are ontological assumptions about single actual elephants instead. Third, the difference between mechanistic description, models, and laws is not of kind but degree: when they address a common slice of the world, they provide a map of it with different details, abstraction, and scope. Under these commitments, techniques can be understood as devices to respond to special questions about the elephant.

#### *1.3.1 Can this Single Factor Make Any Difference?*

The family of experimental and quasi-experimental techniques offers the most renowned, successful, and contentious example at once due to the diffusion of randomized controlled trials as the "gold standard" of scientifc knowledge production (e.g., Kabeer, 2020; Deaton & Cartwright, 2018; Dawid, 2000). This family shares the consideration that although we cannot observe a counterfactual directly, we can construe credible "twin worlds" and "treat" one so that the feature of interest provides the only difference to which the difference in responses can be ascribed.

As Battistin and Bertoni show in Chap. 3, this strategy keeps the role of causal assumptions to the minimum required by a stimulus-response model: the treatment is a supposedly effcient cause and connected to performance by a function of a specifc shape—often, linear—without further details. Unsurprisingly, these techniques are a cornerstone of usable public policy knowledge: they can establish the capacity of a change in taxation, expenditure, information, and regulation to elicit some effect of interest, apparently without the need for further knowledge.

The credibility of this strategy's conclusions, however, rests heavily on the research design: fndings are sound if the twin worlds are construed as statistically identical and independent aggregates, the treatment is forced evenly onto all the units of one world only, and the difference in responses is not affected by the treating procedure or unrelated endogenous dynamics. The threats arise as the statistical aggregates with identical parameters can hide a remarkable inner heterogeneity that may bias both groups' responses in unknown directions. As elaborated by Negri in Chap. 4 and Ornstein in Chap. 5, within the family, this heterogeneity is addressed as the result of selection biases that can be reduced by accounting for observed imbalances and crafting "populations of twins." The solution, however, leaves the issue open of the bending effects from unobservable factors.

The (quasi-)experimental family, in short, can provide reliable measures of the net effect of a treatment, but necessarily at the cost of disregarding the reasons for the diversity in the responses of the treated.

#### *1.3.2 Through Which Structures?*

The diversity in responses is instead the driving concern of the second group of techniques. They address it by fipping the experimental balance of model and design and committing themselves to additional assumptions. They conceive of the generative process as patterns of dependence and assign causal relevance to the bundle of factors that ft them.

The reliance on models sidelines the issue of unit selection as, ideally, any unit carries usable information about the tenability of the causal structure of interest. The structure, moreover, provides the fxed points that still make counterfactuals observable. However, models require criteria to select meaningful variables, and structural assumptions provide partial guidance to it. The main decisions can only be made in light of substantive theories about the generation of the outcome hence, of some previous local knowledge. Within this framework, each technique relies on different languages and pursues different goals.

Path analysis develops within a Bayesian mindset and understands causation as ordered dependencies ftting a few known shapes: chains, colliders, and forks. As Röth clarifes in Chap. 6, these shapes explain because they elaborate on the connection between an alleged causal condition and the dependent by displaying the intermediate causal link, the common factor, or the equivalent alternative factors that support the hypothesis about the unfolding of the causal process before the outcome. The technique supports a neater identifcation of the mechanism linking a factor of interest and its outcome, affords counterfactual analysis, and provides specifc suggestions about the "scope conditions" ensuring the mechanisms. Röth contends that these features qualify path analysis as the natural companion of experimental studies for its capacity to establish the contextual requirements that enhance and refne the validity of their fndings.

Qualitative comparative analysis (QCA) instead builds on sets and Boolean algebra and understands causal structures as teams of individually necessary and jointly suffcient factors to an outcome. In Chap. 7, Damonte makes three points about the explanatory import of the technique. First, its assumptions about the shape of causation support complex causal theories about the interactions of triggering, enabling, or shielding conditions of some underlying causal process. Second, its parameters of ft allow diagnosing the underspecifcation of the theory to the cases at hand, while the algorithm provides a pruning counterfactual device that takes care of its overspecifcation. Last, sets remap qualities onto quantities, which warrant meaningful and sound solutions. Thus, QCA can formalize and test theories about the teams of conditions beneath policy success and failure across given cases beyond special processes. As such, the technique especially suits the purpose of systematic *ex-post* evaluation of policy designs.

#### *1.3.3 Through Which Process?*

The knowledge of the dynamics of a causal situation is the missing piece of knowledge and the core concern of two further strategies, aiming to open up the black box of causation. Both share the direct interest in the actors and their interplay as the ultimate ground of causation, although their point of attack within the causal stream of actions is different.

Bayesian process tracing addresses causation within its local context. In Chap. 8, Bennett shows how analysts can rely on this technique to make causal sense of the chain of events to policy success or failure retrospectively. The strategy understands hypotheses as plausible Bayesian beliefs that we can entertain about the causal process and that evidence can confrm or disconfrm. The weight of evidence rests on the assumption that each hypothesis corresponds to a specifc sequence of actions and events that leave empirical traces. When the connection between a piece of evidence and a hypothesis is unique, certain, or both, the actual retrieval of certain traces in a case contributes to ranking hypotheses by their relative likelihood and eventually licenses the ascription of the case to the hypothesis with the best standing.

Last but not the least, agent-based models make it possible to test hypotheses about causal processes as emergent phenomena in silico. As Squazzoni and Bianchi illustrate in Chap. 9, the technique relies on simulation to verify whether a certain alignment of assumptions about actors and their constraints, when translated into conditional rules of individual behavior and recursively played, returns performance values close to the empirical responses of actual systems. The strategy requires regularity and counterfactual assumptions about the options available to each agent, rendered as alternative states, and about the consequence of choosing a state conditional on the states of the relevant neighbors. These models shed light on the tenability of different understandings of the mechanism that alternative policy constraints or endowments activate in the feld.

#### *1.3.4 Considerations and Extensions*

The order of the chapters, as Beach and Siewert reason in their Chap. 10, chimes with the common prescription in mixed method research that a better causal knowledge follows from a succession of techniques zooming into individual cases, where causation unfolds as actual processes and explanations can fnd their ultimate validation. However, they consider the downward path of mixed methods lays knowledge open to heterogeneity threats. The actual heterogeneity is always equal to the number of instances under analysis; cross-case knowledge, however, requires that we dismiss some heterogeneity as irrelevant to afford comparisons and causal inferences. The move to local contexts implies a twofold shift—from a low to a high number of factors in the analysis and from coarse types to fne-grained tokens of evidence—that seldom support cross-case fndings. Hence, they contend that a more fruitful and conventional strategy follows the upward path from local processes over structures to the causal capacity of single triggers. This path allows more conscious decisions about heterogeneity that can improve models and gauges.

In Chap. 11, Damonte and Negri conclude the journey. The chapter recognizes the fragmented image of causation that the previous contributions convey and asks whether such fragmentation is an undesirable state of affairs, as claimed by a longhonored narrative from the history of science, or an eventually valuable situation, as argued in the pluralist quarters of the philosophy of science. The point of contention concerns the inability to yield dovetailing knowledge that would affect strategies built on alternative tenets. The chapter revises these tenets and contends that, whereas ontology offers complementary angles of attack to the causal elephant and epistemology licenses interpretations that can estrange research communities from one another, methodological reasoning about models and designs reconciles the analyses when it emphasizes that causation corresponds to a few recognized shapes. These shapes, the chapter concludes, offer a rough yet common map of the elephant that strategies of any stripe can detail and enrich while pursuing their special research interests—thus contributing to better policy knowledge.

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 2 Causation in the Social Realm**

**Daniel Little**

**Abstract** Explanation is at the center of scientifc research, and explanation almost always involves the discovery of causal relations among factors, conditions, or events. This is true in the social sciences no less than in the natural sciences. But social causes look quite a bit different from causes of natural phenomena. They result from the choices and actions of numerous individuals rather than fxed natural laws, and the causal pathways that link antecedents to consequents are less exact than those linking gas leaks to explosions. It is, therefore, a crucial challenge for the philosophy of social science to give a compelling account of causal reasoning about social phenomena that does justice to the research problems faced by social scientists.

#### **Learning Objectives**

By studying this chapter, you will:


#### **2.1 Why Discuss the Ontology of Causation?**

Ontology precedes methodology. We cannot design good methodologies for scientifc research without having reasonably well-developed ideas about the nature of the phenomena that we intend to investigate (Little, 2020). This point is especially important in approaching the idea of social causation. Only when we have a reasonably clear understanding of the logic and implications of the scientifc idea of

D. Little (\*)

University of Michigan-Dearborn, Dearborn, MI, USA e-mail: delittle@umich.edu

<sup>©</sup> The Author(s) 2023

A. Damonte, F. Negri (eds.), *Causality in Policy Studies*, Texts in Quantitative Political Analysis, https://doi.org/10.1007/978-3-031-12982-7\_2

causality can we design appropriate methods of inquiry for searching out causal relations. And only then can we give a philosophically adequate justifcation of existing methods—that is, an account of how the research method in question corresponds to a sophisticated understanding of the nature of the social world.

Here I will work within the framework of an "actor-centered" view of social ontology (Little, 2006, 2014, 2016). On this view, the social realm is constituted by individual actors who themselves have been cultivated and developed within ongoing social relations and who conduct their lives and actions according to their understandings and purposes. Social structures, social institutions, organizations, normative systems, cultures, and technical practices all derive their characteristics and causal powers from the socially constituted and situated individuals who make them up (Little, 2006).

This fact about social entities and processes suggests a high degree of contingency in the social world. Unlike chemistry, the social world is not a system of lawgoverned processes; it is instead a mix of different sorts of institutions, forms of human behavior, natural and environmental constraints, and contingent events. The entities that make up the social world at a given time and place have no essential ontological stability; they do not fall into "natural kinds"; and there is no reason to expect deep similarity across a number of ostensibly similar institutions—states, for example, or labor unions. The "things" that we fnd in the social world are heterogeneous and contingent. And the metaphysics associated with classical thinking about the natural world—laws of nature; common, unchanging structures; and fully predictable processes of change—do not provide appropriate building blocks for our understandings and expectations of the social world nor do they suggest the right kinds of social science theories and constructs.

Instead of naturalism, this actor-centered approach to social ontology leads to an approach to social science theorizing that emphasizes agency, contingency, and plasticity in the makeup of social facts. It recognizes that there is a degree of pattern in social life, but emphasizes that these patterns fall far short of the regularities associated with laws of nature. It emphasizes contingency of social processes and outcomes. It insists upon the importance and legitimacy of eclectic use of multiple social theories: social processes and entities are heterogeneous, and therefore, it is appropriate to appeal to different types of social theories as we explain various parts of the social world. It emphasizes the importance of path dependence in social outcomes.

#### **Box 2.1 Defnitions**

**Agency**: The fact that social change and causation derives from the purposive actions of individual social actors.

**Contingency**: Social outcomes depend upon conjunctions of occurrences that need not have taken place, so the outcome itself need not have taken place. Closely related to "path dependency."

#### **Box 2.1** (continued)

**Path dependency**: The feature of social processes according to which minor and underdetermined events in an early stage of a process make later changes more probable. For example, the QWERTY arrangement of the typewriter keyboard was selected in order to prevent typists from jamming the mechanism by typing too rapidly. Fifty years later, after widespread adoption, it proved impossible to adopt a more effcient arrangement of the keys to permit more rapid typing.

**Plasticity**: A feature of an entity or group of entities according to which the properties of the entity can change over time. Biological species demonstrate plasticity through evolution, and social entities demonstrate plasticity through the piecemeal changes introduced into them by a variety of actors and participants.

How does this ontological perspective ft with current work in policy studies? There are several current felds of social research that illustrate this approach particularly well. One is the feld of the "new institutionalism." Researchers in this tradition examine the specifc rules and incentives that constitute a given institutional setting. They examine the patterns of behavior that these rules and incentives give rise to in the participants in the institution, and they consider as well the opportunities and incentives that exist for various powerful actors to either maintain the existing institutional arrangements or modify them. Kathleen Thelen's (2004) study of different institutions of skill formation in Germany, Great Britain, the United States, and Japan is a case in point. This approach postulates the causal reality of institutions and the specifc ensembles of rules, incentives, and practices that make them up; it emphasizes that differences across institutions lead to substantial differences in behavior; and it provides a basis for explanations of various social outcomes. The rules of liability governing the predations of cattle in East Africa or Shasta County, California, create very different patterns of behavior in cattle owners and other landowners in the various settings (Ellickson, 1991). It is characteristic of the new institutionalism that researchers in this tradition generally avoid reifying large social institutions and look instead at the more proximate and variable sets of rules, incentives, and practices within which people live and act.

#### **2.2 Scientifc Realism About the Social World and Social Causation**

We are best prepared for the task of discovering causal relationships in the social world when we adopt a realist approach to the social world and to social causation. We provide an explanation of an event or pattern when we succeed in identifying the real causal conditions and events that brought it about. The central tenet of causal realism is a thesis about causal mechanisms and causal powers. Causal realism holds that we can only assert that there is a causal relationship between X and Y if we can offer a credible hypothesis of the sort of underlying mechanism that connects X to the occurrence of Y. The sociologist Mats Ekström puts the view this way: "the essence of causal analysis is … the elucidation of the processes that generate the objects, events, and actions we seek to explain" (Ekström, 1992: 115). Authors who have urged the centrality of causal mechanisms for explanatory purposes include Roy Bhaskar (1975), Nancy Cartwright (1989), Jon Elster (1989), Rom Harré and Madden (1975), Wesley Salmon (1984), and Peter Hedström (2005).

Scientifc realism about social causes comes down to several simple ideas.

First, there is such a thing as social causation. Causal realism is a defensible position when it comes to the social world: there are real causal relations among social factors (structures, institutions, groups, norms, and salient social characteristics like race or gender). We can give a rigorous interpretation to claims like "racial discrimination causes health disparities in the United States" or "rail networks cause changes in patterns of habitation."

Second, causal relations among factors or events depend on the existence of real social-causal mechanisms linking cause to effect. Discovery of correlations among factors does not constitute the whole meaning of a causal statement. Rather, it is necessary to have a hypothesis about the mechanisms and processes that give rise to the correlation. Hypotheses about the causal mechanisms that exist among factors of interest permit the researcher to exclude spurious correlation (cases where variations in both factors are the result of some third factor) and to establish the direction of causal infuence (cases where it is unclear whether the correlation between A and B results from A causing B or B causing A). So mechanisms are more fundamental than regularities.

Third, the discovery of social mechanisms in policy studies often requires the formulation of mid-level theories and models of these mechanisms and processes for example, the theory of free-riders. For example, an urban policy researcher may observe that racially mixed high-poverty neighborhoods have higher levels of racial health disparities than racially mixed low-poverty neighborhoods. This is an observation of correlation. Researchers like Robert Sampson (2010) would like to know how "neighborhood effects" work in transmitting racial health disparities. What are the mechanisms by which a neighborhood infuences the health status of an individual household? In order to attempt to answer this question, Sampson turns to mid-level hypotheses in urban sociology that contribute to a theory of the mechanisms involved in this apparent causal relationship. By mid-level theory, I mean essentially the same thing that Robert Merton (1963) conveyed when he introduced the term: an account of the real social processes that take place above the level of isolated individual action but below the level of full theories of whole social systems. Marx's theory of capitalism illustrates the latter; Jevons's theory of the individual consumer as a utility maximizer illustrates the former. Coase's theory of transaction costs (Coase, 1988) is a good example of a mid-level theory: general enough to apply across a wide range of institutional settings, but modest enough in its claim of comprehensiveness to admit of careful empirical investigation. Signifcantly, the theory of transaction costs has spawned major new developments in the new institutionalism in sociology (Brinton & Nee, 1998).

And fnally, it is important to recognize and welcome the variety of forms of social scientifc reasoning that can be utilized to discover and validate the existence of causal relations in the social world. Properly understood, there is no contradiction between the effort to use quantitative tools to chart the empirical outlines of a complex social reality, and the use of theory, comparison, case studies, process tracing, and other research approaches aimed at uncovering the salient social mechanisms that hold this empirical reality together.

#### *2.2.1 Critical Realism*

Critical realism is a specifc tradition within the late-twentieth-century analytic philosophy that derives from the work of Rom Harré and Roy Bhaskar (Harré & Madden, 1975; Bhaskar, 1975; Archer et al., 2016). In brief, the view holds that the ontological stance of realism is required for a coherent conception of scientifc knowledge itself. Unqualifed skepticism about "unobservable entities" makes scientifc research and experimentation philosophically incoherent. We are forced to take the view that the entities postulated by our best theories of the world are "real"—whether electrons, viruses, or social structures. For Bhaskar, this ontological premise has much the status of Kant's transcendental arguments for causation and space and time: we cannot make sense of experience without postulating causation and locations in space and time (Bhaskar, 1975).

Concretely in the social sciences, this is taken to mean that we can be confdent in asserting that social entities exist if these concepts play genuine roles in welldeveloped and empirically supported theories of the social world: for example, organizations, markets, institutions, social classes, normative systems, rules, ideologies, and social networks. Further, we can be confdent in attributing causal powers and effects to the various social entities that we have identifed—always to be supported by empirical evidence of various kinds.

#### **2.3 What Is Causation?**

Let us turn now to a more specifc analysis of causation. What do we mean by a cause of something? Generally speaking, a cause is a circumstance that serves to bring about (or renders more probable) its effect, in a given environment of background conditions. Causes *produce* their effects (in appropriate background conditions). A current fruitful approach is to understand causal linkages in terms of the specifc *causal mechanisms* that link cause to effect.

We can provide a preliminary defnition of causation along these lines:


That is, A is necessary and suffcient in conditions Ci for the production of B. This defnition can be understood in either a deterministic version or a probabilistic version. The deterministic version asserts that A in the presence of Ci always brings about B; the probabilistic version asserts that the occurrence of A in the presence of Ci increases the likelihood of the occurrence of B.

There is a fundamental choice to be made when we consider the topic of causation. Are causes real, or are causal statements just summaries of experimental and observational results and the statistical fndings that can be generated using these sets of data? The frst approach is the position described above as causal realism, while the second can be called causal instrumentalism. If we choose causal realism, we are endorsing the idea that there is such a thing as a *real* causal linkage between A and B; that A has the power to produce B; and that there is such a thing as causal necessity. If we choose causal instrumentalism, we are agnostic about the underlying realities of the situation, and we restrict our claims to observable patterns and regularities. The philosopher David Hume (2007) endorsed the second view; whereas many philosophers of science since the 1970s have endorsed the former view.

Most of the contributors to the current volume engage with the premises of causal realism. They believe that social causation is real; there are real social relations among social factors (structures, institutions, groups, norms, and salient social characteristics like race or gender), and there are real underlying causal mechanisms and powers that constitute those causal relations. According to scientifc realists, a key task of science is to discover the causal mechanisms and powers that underlie the observable phenomena that we study.

Causal realists acknowledge a key intellectual obligation that goes along with postulating real social mechanisms: to provide an account of the ontological *substrate* within which these mechanisms operate**.** In the social realm, the substrate is the system of social actors whose mental frameworks, actions, and relationships constitute the social world. This is what is meant by an "actor-centered" ontology of the social world. On this view, every social mechanism derives from facts about individual actors, the institutional context, the features of the social construction and development of individuals, and the factors governing purposive agency in specifc sorts of settings. Different research programs in the social sciences target different aspects of this nexus.

This view of the underlying reality of social causation justifes a conception of causal necessity in the social realm. Do causes make their effects "necessary" in any useful sense? This is the claim that Hume rejected—the notion that there is any "necessary" connection between cause and effect. By contrast, the notion of *natural necessity* is sometimes invoked to capture this idea:

#### 2 Causation in the Social Realm

• A causes B: given the natural properties of A and given the laws of nature and given the antecedent conditions, B necessarily occurs.

This can be paraphrased as follows:

• Given A, B occurs as a result of natural necessity.

So the sense of necessity of the occurrence of the effect in this case is this: given A and given the natural properties and powers of the entities involved, B had to occur. Or in terms of possible worlds and counterfactuals (Lewis, 1973), we can say:

• In any possible world in which the laws of nature obtain, when A occurs, B invariably occurs as well.

Applied to social causation within the context of an ontology of actor-centered social facts, here is what causal necessity looks like:

• Given the beliefs, intentions, values, and goals of various participants and given the constraints, opportunities, and incentives created by the social context, whenever A occurs, the outcome B necessarily occurs [fnancial crisis, ethnic violence, rapid spread of infectious disease …].

This conception aligns with Wesley Salmon's idea of the "causal structure of the world," applied to the social world (1984). And this in turn indicates why causal mechanisms are such an important contribution to the analysis of causation. A causal mechanism is a constituent of this "stream of events" leading from A to B.

Probabilistic causal relations involve replacing exceptionless connections among events with probabilistic connections among events. A has a probabilistic causal relationship to B just in case the occurrence of A increases (or decreases) the likelihood of the occurrence of B. This is the substance of Wesley Salmon's (1984) criterion of causal relevance. Here is Salmon's idea of causal relevance:

• A is causally relevant to B *if and only if* the conditional probability of B given A is different from the absolute probability of B (Salmon, 1984, adapted notation).

For a causal realist, the defnition is extended by a hypothesis about an underlying causal mechanism. For example, smoking is causally relevant to the occurrence of lung cancer [working through physiological mechanisms X, Y, Z]. And cell physiologists are expected to provide the mechanisms that connect exposure to tobacco smoke to increased risk of malignant cell reproduction.

It is important to emphasize that we can be causal realists about probabilistic causes just as we can about deterministic causes. A causal power or capacity is expressed as a tendency to produce an outcome; but this tendency generally requires facilitating conditions in order to be operative. The causal power is appropriately regarded as being real, whether or not it is ever stimulated by appropriate events and circumstances. A given cube of sugar is soluble, whether or not it is ever immersed in water at room temperature.

These defnitions have logical implications that suggest different avenues of research and inquiry in the social sciences. First, both the deterministic and the probabilistic versions imply the truth of a *counterfactual* statement: If A had not occurred in these circumstances, B would not have occurred. (Or if A had not occurred in these circumstances, the probability of B would not have increased.) The counterfactual associated with a causal assertion suggests an experimental approach to causal inquiry. We can arrange a set of circumstances involving Ci and remove the occurrence of A and then observe whether B occurs (or observe the conditional probability of the occurrence of B).

Another important implication of a causal assertion is the idea of a set of necessary and suffcient conditions for the occurrence of E, the circumstance of explanatory interest. With deterministic causation, the assertion of a causal relationship between A and B implies that A is suffcient for the occurrence of B (in the presence of Ci) and often the assertion implies that A is a necessary condition as well. (If A had not occurred, then B would not have occurred.) On these assumptions, a valid research strategy involves identifying an appropriate set of cases in which A, Ci, and B occur, and then observe whether the appropriate covariances occur or not. J. L. Mackie (1974) provided a more detailed analysis of the logic of necessary and suffcient conditions in complex conjunctural causation with his concept of an INUS condition: "*insuffcient* but *non-redundant* part of an *unnecessary* but *suffcient* condition" (62). Signifcantly, Mackie's formulation provides a basis for a Boolean approach to discovering causal relations among multiple factors.

These defnitions and logical implications give scope to a number of different strategies for investigating causal relationships among various conditions. For probabilistic causal relationships, we can evaluate various sets of conditional probabilities corresponding to the presence or absence of conditions of interest. For deterministic causal relationships, we can exploit the features of necessary and suffcient conditions by designing a "truth table" or Boolean test of the co-occurrence of various conditions (Ragin, 1987). This is the logic of Mill's methods of similarity and difference (Mill, 1988; Little, 1995). For both deterministic and probabilistic causal relationships, we can attempt to discover and trace the workings of the causal mechanisms that link the occurrence of A to the occurrence of B.

#### *2.3.1 Causal Mechanisms*

As noted above, the central tenet of causal realism is a thesis about the real existence of causal mechanisms and causal powers. The fundamental causal concept is that of a mechanism through which A brings about or produces B (Little 2011). According to this approach, we can only assert that there is a causal relationship between A and B if we can offer a credible hypothesis of the sort of underlying mechanism that connects A to the occurrence of B. This is central to our understanding of causation from single-case studies to large statistical studies suggesting causal relationships between two or more variables. Peter Hedström and other exponents of analytical sociology are recent voices for this approach for the social sciences (Hedström, 2005; Hedström & Ylikoski, 2010). An important paper by Machamer et al. (2000) sets the terms of current technical discussions of causal mechanisms, and James Mahoney (2001) surveyed the various theories of causal mechanisms and called for a greater specifcity.

What is a causal mechanism? Consider this formulation: a causal mechanism is a sequence of events, conditions, and processes leading from the explanans to the explanandum (Little, 1991: 15, 2016: 190–192). A causal relation exists between A and B if and only if there is a set of causal mechanisms that lead from A to B. This is an ontological premise, asserting that causal mechanisms are real and are the legitimate object of scientifc investigation.

The theory has received substantial development in the biological sciences. Glennan et al. (2021) put the mechanisms theory in the form of six brief theses:


This defnition is developed for explanations in biology, but it works well with typical examples of social mechanisms.

The idea that there are real mechanisms embodied in a given domain of phenomena provides a way of presenting causal relations that serves as a powerful alternative to the pure regularity view associated with Hume and purely quantitative approaches to causation. Signifcantly, this is the thrust of Judea Pearl's development of structural equation modeling (discussed below): in order to get a basis for causal inference out of a statistical analysis of a large dataset, it is necessary to provide a theory of the causal mechanisms and relations that are at work in this domain (Pearl, 2021).

Mechanisms bring about specifc effects. For example, "over-grazing of the commons" is a mechanism of resource depletion. Whenever the conditions of the mechanism are satisfed, the result ensues. Moreover, we can reconstruct why this would be true for purposive actors in the presence of a public good (Hardin, 1968). Or consider another example from the social sciences: "the mechanism of stereotype threat causes poor performance on standardized tests by specifc groups" (Steele, 2011). This mechanism is a hypothesized process within the cognitive–emotional system of the subjects of the test, leading from exposure to the stereotype threat through a specifed cognitive–emotional mechanism to impaired performance on the test. So we can properly understand a claim for social causation along these lines: "C causes E" rests upon the hypothesis that "there is a set of causal mechanisms that convey circumstances including C to circumstances including E." In the social realm, we can be more specifc. "C causes E" implies the belief that "there is a set of opportunities, incentives, rules, and norms in virtue of which actors in the presence of C bring about E through their actions."

Are there any social mechanisms? There are many examples from every area of social research. For example: "Collective action problems often cause strikes to fail." "Increasing demand for a good causes prices to rise for the good in a competitive market." "Transportation systems cause shifts of social activity and habitation." "Recognition of mutual interdependence leads to medium-term social cooperation in rural settings." In each case, we have a causal claim that depends on a hypothesis about an underlying behavioral, cognitive, or institutional mechanism producing a pattern of collective behavior.

The discovery of social mechanisms often requires the formulation of mid-level theories and models of these mechanisms and processes—for example, the theory of free-riders or the theory of grievance escalation in contentious politics. Mid-level theories in the social sciences can be viewed as discrete components of a toolbox for explanation. Discoveries about specifc features of the workings of institutions, individual-collective paradoxes, failures of individual rationality like those studied in behavioral economics—all of these mid-level theories of social mechanisms can be incorporated into an account of the workings of specifc social ensembles. The response of a university to a sudden global pandemic may be seen as an aggregation of a handful of well-known institutional dysfunctions, behavioral patterns, and cognitive shortcomings on the part of the various actors.

Aage Sørensen summarizes a causal realist position for the social and policy sciences in these terms: "Sociological ideas are best reintroduced into quantitative sociological research by focusing on specifying the mechanisms by which change is brought about in social processes" (Sørensen, 1998: 264). Sørensen argues that social explanation requires better integration of theory and evidence. Central to an adequate explanatory theory, however, is the specifcation of the mechanisms that are hypothesized to underlie a given set of observations. "Developing theoretical ideas about social processes is to specify some concept of what brings about a certain outcome—a change in political regimes, a new job, an increase in corporate performance, … The development of the conceptualization of change amounts to proposing a mechanism for a social process" (Sørensen, 1998: 239–240). If an educational policy researcher fnds that there is an empirical correlation between schools that have high turnover of teaching staff and high dropout rates, it is very important to investigate whether there is a mechanism that leads from teacher turnover to student dropout. Otherwise, both characteristics may be the joint result of a third factor (inadequate school funding, for example). Sørensen makes the critical point that one cannot select a statistical model for analysis of a set of data without frst asking the question, "What in the nature of the mechanisms do we wish to postulate to link the infuences of some variables with others?" Rather, it is necessary to have a hypothesis of the mechanisms that link the variables before we can arrive at a justifed estimate of the relative importance of the causal variables in bringing about the outcome.

Emphasis on causal mechanisms for adequate social explanation has several favorable benefts for policy research. Policy research is always concerned about

causation: what interventions can be made that would bring about different outcomes? When policy researchers look carefully for the social mechanisms that underlie the processes that they study, they are in a much better position to diagnose the reasons for poor outcomes and to recommend interventions that will bring about better outcomes. Emphasis on the need for analysis of underlying causal mechanisms takes us away from uncritical reliance on uncritical statistical models.

#### *2.3.2 Causal Powers*

Some philosophers of science have argued that substantive theories of causal powers and properties are crucial to scientifc explanation. Leading exponents of this view include Rom Harré (Harré & Madden 1975), Nancy Cartwright (1989), and Stephen Mumford (2009). Nancy Cartwright places real causal powers and capacities at the center of her account of scientifc knowledge (1989). As she and John Dupré put the point, "things and events have causal capacities: in virtue of the properties they possess, they have the power to bring about other events or states" (Dupré & Cartwright, 1988). Cartwright argues, for the natural sciences, that the concept of a real causal connection among a set of events is more fundamental than the concept of a law of nature. And most fundamentally, she argues that identifying causal relations requires substantive theories of the causal powers ("capacities", in her language) that govern the entities in question. Causal relations cannot be directly inferred from facts about association among variables. As she puts the point, "No reduction of generic causation to regularities is possible" (1989: 90). The importance of this idea for sociological research is profound; it confrms the notion shared by many researchers that attribution of social causation depends inherently on the formulation of good, middle-level theories about the real causal properties of various social forces and entities.

Cartwright's philosophy of causation points to the idea of a causal power—a set of propensities associated with a given entity that actively bring about the effect. The causal powers theory rests on the claim that causation is conveyed from cause to effect through the active *powers and capacities* that inhere in the entities making up the cause.

The idea of an ontology of causal powers is that certain kinds of things (metals, gases, military bureaucracies) have internal characteristics that lead them to interact causally with the world in specifc and knowable ways. This means that we can sometimes identify dispositional properties that attach to kinds of things. Metals conduct electricity; gases expand when heated; military bureaucracies centralize command functions (Harré & Madden, 1975). Stephen Mumford and Rani Lill Anjum explore the philosophical implications of a powers theory of causation (2011).

The language of causal powers allows us to incorporate a number of typical causal assertions in the social sciences: "Organizations of type X produce lower rates of industrial accidents"; "paramilitary organizations promote fascist mobilization"; "tenure systems in research universities promote higher levels of faculty research productivity." In each case, we are asserting that a certain kind of social organization possesses, in light of the specifcs of its rules and functioning, a disposition to stimulate certain kinds of participant behavior and certain kinds of aggregate outcomes. This is to attribute a specifc causal power to species of organizations and institutions.

Sociologist James Coleman offered the view that we should distinguish carefully between macro-level social factors and micro-level individual action (Coleman, 1990). He held that all social causation proceeded through three distinct paths: social factors that infuence individual behavior, individuals who interact with each other and create new social facts, and the creation of new macro-level social factors that are the aggregate result of individual actions and interactions at the micro-level. Coleman did not believe that there were direct causal infuences from one macrolevel social fact to another macro-level social fact. Coleman offered a diagram of this view, which came to be known as "Coleman's boat" (Fig. 2.1). On this view, when we say that a certain social entity, structure, or institution has a certain power or capacity, we mean something reasonably specifc: given its confguration, it creates an environment in which individuals commonly perform a certain kind of action. This is the downward strut in the Coleman's boat diagram, labeled 1 in Fig. 2.1. This approach has two important consequences. First, social powers are not "irreducible"—rather, we can explain how they work by analyzing the specifc environment of formation and choice they create. And second, they cannot be regarded as deriving from the "essential" properties of the entity. Change the institution even slightly and we may fnd that it has very different causal powers and capacities. Change the rules of liability for open-range grazing and you get different patterns of behavior by ranchers and farmers (Ellickson, 1991).

#### *2.3.3 Manipulability and Invariance*

Several other aspects of the causal structure of the world have been important in recent discussions of causality in the social sciences. Jim Woodward is a leading exponent of the manipulability (or interventionist) account. He develops his views in detail in his recent book, *Making Things Happen: A Theory of Causal Explanation* (2003). The view is an intuitively plausible one: causal claims have to do with judgments about how the world would be if we altered certain circumstances. If we observe that the concentration of sulfuric acid is increasing in the atmosphere leading to acid rain in certain regions, we might consider the increasing volume of H2SO4 released by coal power plants from 1960 to 1990. And we might hypothesize that there is a causal connection between these facts. A counterfactual causal statement holds that if X (increasing emissions) had not occurred, then Y (increasing acid rain) would not have occurred. The manipulability theory adds this point: if we could remove X from the sequence, then we would alter the value of Y. And this, in turn, makes good sense of the ways in which we design controlled experiments and policy interventions.

Woodward extends this analysis to develop the idea of a relationship that is "invariant under intervention." This idea follows the notion of experimental testing of a causal hypothesis. We are interested in the belief that "X causes Y." We look for interventions that change the state of Y. If we fnd that the only interventions that change Y, do so through their ability to change X, then the X–Y relation is said to be invariant under intervention, and X is said to cause Y (Woodward, 2003: 369–370). Woodward now applies this idea to causal mechanisms. A mechanism consists of separate components that have intervention–invariant relations to separate sets of outcomes. These components are modular: they exercise their infuence independently. And, like keys on a piano, they can be separately activated with discrete results. This amounts to a precise and novel specifcation of the meaning of "causal mechanism": "So far I have been arguing that components of mechanisms should behave in accord with regularities that are invariant under interventions and support counterfactuals about what would happen in hypothetical experiments" (374).

A related line of thought on causal analysis is the idea of *difference-making.* This approach to causation focuses on the explanations we are looking for when we ask about the cause of some outcome. Here philosophers note that there are vastly many conditions that are causally necessary for an event but do not count as being explanatory. Lee Harvey Oswald was alive when he fred his rife in Dallas; but this does not play an explanatory role in the assassination of Kennedy. Crudely speaking, we want to know which causal factors were *salient* and which factors made a difference in the outcome. Michael Strevens (2008) provides an innovative explication of this set of intuitions through the idea of "Kairetic" explanation, a formal way of identifying salient causal factors out of a haystack of causally involved factors in the occurrence of an event guided by generality, cohesion, and accuracy. "To this end, I formulate a recipe that extracts from any detailed description of a causal process a higher level, abstract description that specifes only difference-making properties of the process" (Strevens 2008: xiii).

#### **2.4 Pluralism About Causal Inquiry**

This volume is concerned with the problem of causal inquiry and methods for the discovery of causal relations among factors. How can social researchers identify causal relations among social events and structures? The problem of causal inference is fundamental to methodology in the social and policy sciences. A wellinformed and balanced handbook of political science methodology is provided by Box-Steffensmeier et al. (2008). Here I will provide a brief discussion of several approaches to causal inferences in the social sciences that follows the typology offered there. Especially relevant is Henry Brady's contribution to the volume (Brady, 2008).

In their introduction to the volume, Box-Steffensmeier, Brady, and Collier propose that there are three important kinds of questions to answer when we are investigating the idea of causal relations in the social world. First is semantic: what do we mean by statements such as "A causes B"? Second is ontological: what are the features of the world that we intend to identify when we assert a causal relationship between A and B? And third is epistemological: through what kinds of investigations and processes of inference can we establish the likelihood of a causal assertion about the relationship that exists among two or more features of the social world? The last question brings us to scientifc methodology and a variety of techniques of causal inquiry and inference. However, Box-Steffensmeier, Brady, and Collier are correct in asserting the prior importance of the other two families of questions. We cannot design a methodology of inquiry without having a reasonably well-developed idea of what it is that we are searching for, and that means we must provide reasonable answers to the semantic and ontological questions about causation frst. The editors also make a point that is central to the current chapter as well, in favor of a pluralism of approaches to the task of causal inquiry in the social sciences (2008: 29). There is no uniquely best approach to causal inquiry in the social and policy sciences. The editors refer explicitly to a range of approaches that can be used to investigate causation in the social world: qualitative and quantitative investigation, small-n or large-n studies, experimental data, detailed historical narratives, and other approaches.

Henry Brady (2008) provides a useful typology of several families of methods of inquiry and inference that have developed within the social sciences and that fnd a clear place within the semantic and ontological framework of causation that is developed in this chapter. Brady distinguishes among "neo-humean regularity" approaches, counterfactual approaches, manipulation approaches, and mechanism approaches. And he shows how a wide range of common research methods in the social sciences fall within one or the other of these rubrics. Each of these families of approaches derives from a crucial feature of what we mean by a causal relationship: the fact that causes commonly produce their effects, giving rise to observable regularities; the fact that causes act as suffcient and necessary conditions for their effects, giving rise to the possibility of making inferences about counterfactual scenarios; the fact that causes produce or inhibit other events, giving rise to the possibility of intervening or manipulating a sequence of events; and the fact that causal relations are real and are conveyed by specifc (unobservable) sequences of mechanisms leading from cause to effect, giving rise to the importance of attempting to discover the operative mechanisms.

Brady's typology suggests a variety of avenues of causal inquiry that are possible in the social sciences, given the foregoing analysis of social causes. The ideas sketched in previous sections about the ontology of social causation support multiple avenues for discovering causation. Causes produce their effects, causes work through mechanisms, causal relationships should be expected to result in strong associations among events, and causal necessity supports counterfactual reasoning. We can thus design methods of inquiry that take advantage of the various of ontological characteristics of social causation.

First, the primacy of "real underlying causal mechanisms" suggests that direct research aimed at discovery of the social pathways through which a given outcome is produced by the actions of individual actors within given institutional and normative circumstances is likely to be fruitful. Theory formation about the "institutional logics" created by a given institutional setting can be supplemented by direct study of cases to attempt to identify the pathways hypothesized (Thornton et al., 2012). These insights into the ontology of causation provide encouragement for case-based methods of inquiry, including process tracing, comparative studies, and testing of middle-level social theories of mechanisms. This is a set of methodological ideas supporting causal inquiry developed in detail by George and Bennett (2005), Steinmetz (2004, 2007), and Ermakoff (2019).

Second, the logic of necessary and suffcient conditions associated with the concept of a cause implies methods of research based on experimentation and observation. If we hypothesize that X is a necessary condition for the occurrence of Y, we can design a research study that searches for cases in which Y occurs but X does not. Ragin (1987), Mill (1988), and Tarrow (2010) describe the logic of such cases. The logic of necessary and suffcient conditions also supports research designs based on experimental and quasi-experimental methods—research studies in which the researcher attempts to isolate the phenomenon of interest and observes the outcomes with and without the presence of the hypothetical causal factor. Woodward (2003) illustrates the underlying logic of the experimental approach.

John Stuart Mill's methods of similarity and difference (1988) derive from this feature of the logic of causation. If we believe that A1 & A2 are jointly suffcient to produce B, we can evaluate this hypothesis by fnding a number of cases in which A1, A2, and B occur and examine whether there are any cases where A1 & A2 are present but B is absent. If there is such a case, then we can conclude that A1 & A2 are not suffcient for B. Likewise, if we believe that A3 is necessary for the occurrence of B, we can collect a number of cases and determine whether there are any instances where B occurs but A3 is absent. If so, we can conclude that W is not necessary for the occurrence of B.

#### *2.4.1 Case Studies and Process Tracing*

Alexander George and Andrew Bennett (2005) argue for the value of a case study method of social research. The core idea is that investigators can learn about the causation of particular events and sequences by examining the events of the case in detail and in comparison with carefully selected alternative examples. Here is how George and Bennett describe the case study method:

The method and logic of structured, focused comparison is simple and straightforward. The method is "structured" in that the researcher writes general questions that refect the research objective and that these questions are asked of each case under study to guide and standardize data collection, thereby making systematic comparison and cumulation of the fndings of the cases possible. The method is "focused" in that it deals only with certain aspects of the historical cases examined. The requirements for structure and focus apply equally to individual cases since they may later be joined by additional cases. (George & Bennett, 2005: 67)

The case study method is designed to identify causal connections within a domain of social phenomena. How is that to be accomplished? The most important tool that George and Bennett describe is the method of process tracing. "The process-tracing method attempts to identify the intervening causal process—the causal chain and causal mechanism—between an independent variable (or variables) and the outcome of the dependent variable" (206). Process tracing requires the researcher to examine linkages within the details of the case they are studying and then to assess specifc hypotheses about how these links might be causally mediated.

#### *2.4.2 Quantitative Research Based on Observational Data*

Quantitative studies of large populations are supported by this theory of causation, if properly embedded within a set of hypotheses about causal relations among the data. In his presentation of the logic of "structural equation modeling" (SEM) and causal inference, Judea Pearl (2000, 2021) is entirely explicit in stating that pure statistical analysis of covariation cannot establish causal relationships. In particular, Pearl argues that a causal SEM requires:

A set A of qualitative causal assumptions, which the investigator is prepared to defend on scientifc grounds, and a model MA that encodes these assumptions. (Typically, MA takes the form of a path diagram or a set of structural equations with free parameters. A typical assumption is that certain omitted factors, represented by error terms, are uncorrelated with some variables or among themselves, or that no direct effect exists between a pair of variables.) (Pearl, 2021: 71)

Aage Sørensen takes a similar view and describes the underlying methodological premise of valid quantitative causal research in these terms:

Understanding the association between observed variables is what most of us believe research is about. However, we rarely worry about the functional form of the relationship. The main reason is that we rarely worry about how we get from our ideas about how change is brought about, or the mechanisms of social processes, to empirical observation. In other words, sociologists rarely model mechanisms explicitly. In the few cases where they do model mechanisms, they are labeled mathematical sociologists, not a very large or important specialty in sociology. (Sørensen, 2009: 370)

Purely quantitative studies do not establish causation on their own; but when provided with accompanying hypotheses about the mechanisms through which the putative causal infuences obtain, quantitative study can substantially increase our confdence in inferences about causal relationships among factors. Quantitative methods for research on causation advanced signifcantly through the development of structural equation models (SEMs) and the structural causal model methodology described by Judea Pearl and others (Pearl, 2000; Pearl, 2009, 2021). This approach explicitly endorses the notion that quantitative methods require background assumptions about causal mechanisms: "one cannot substantiate causal claims from associations alone, even at the population level—behind every causal conclusion there must lie some causal assumption that is not testable" (Pearl, 2009: 99).

#### *2.4.3 Randomized Controlled Trials and Quasi-experimental Research*

The method of randomized controlled trials (RCT) is sometimes thought to be the best possible way of establishing causation, whether in biology or medicine or social science. An experiment based on random controlled trials can be described simply. It is hypothesized that:

(H) A causes B in a population of units P.

An experiment testing H is designed by randomly selecting a set of individuals from P into Gtest (the test group) and randomly assigning a different set of individuals from P into Gcontrol (the control group). Gtest and Gcontrol are exposed to A (the treatment) under carefully controlled conditions designed to ensure that the ambient conditions surrounding both tests are approximately the same. The status of each group is then measured with regard to B, and the difference in the value of B between the two groups is said to be the "average treatment effect" (ATE). If the average treatment effect is greater than zero, there is prima facie reason to accept H.

This research methodology is often thought to capture the logical core of experimentation and is sometimes thought to constitute the strongest evidence possible for establishing or refuting a causal relationship between A and B. It is thought to represent a purely observational way of establishing causal relations among factors. This is so because of the random assignment of individuals to the two groups (so potentially causally relevant individual differences are averaged out in each group) and because of the strong efforts to isolate the administration of the test so that each group is exposed to the same unknown factors that may themselves infuence the outcome to be measured. As Handley et al. (2018) put the point: "Random allocation minimizes selection bias and maximizes the likelihood that measured and unmeasured confounding variables are distributed equally, enabling any differences in outcomes between the intervention and control arms to be attributed to the intervention under study" (Handley et al., 2018: 6). The social and policy sciences are often interested in discovering and measuring the causal effects of large social conditions and interventions—"treatments", as they are often called in medicine and policy studies. It might seem plausible, then, that empirical social science should make use of random controlled trials whenever possible, in efforts to discover or validate causal connections.

However, this supposed "gold standard" status of random controlled trials has been seriously challenged in the last several years. Serious methodological and inferential criticisms have been raised of common uses of RCT experiments in the social and behavioral sciences, and philosopher of science Nancy Cartwright has played a key role in advancing these criticisms. Cartwright and Hardie (2012) provided a strong critique of common uses of RCT methodology in areas of public policy, and Cartwright and others have offered convincing arguments to show that inferences about causation based on RCT experiments are substantially more limited and conditional than generally believed.

A pivotal debate among experts in a handful of felds about RCT methodology took place in a special issue of *Social Science and Medicine* in 2018. This volume is an essential reading for anyone interested in causal reasoning. Especially important is Deaton and Cartwright (2018). The essence of their critique is summed up in the abstract: "We argue that the lay public, and sometimes researchers, put too much trust in RCTs over other methods of investigation. Contrary to frequent claims in the applied literature, randomization does not equalize everything other than the treatment in the treatment and control groups, it does not automatically deliver a precise estimate of the average treatment effect (ATE), and it does not relieve us of the need to think about (observed or unobserved) covariates" (Deaton & Cartwright, 2018). Deaton and Cartwright provide an interpretation of RCT methodology that places it within a range of comparably reliable strategies of empirical and theoretical investigation, and they argue that researchers need to choose methods that are suitable to the problems that they study.

One of the key concerns they express has to do with extrapolating and generalizing from RCT studies (Deaton & Cartwright, 2018: 3). A given RCT study is carried out in a specifc and limited set of cases, and the question arises whether the effects documented for the intervention in this study can be extrapolated to a broader population. Do the results of a drug study, a policy study, or a behavioral study give a basis for believing that these results will obtain in the larger population? Their general answer is that extrapolation must be done very carefully. "We strongly contest the often-expressed idea that the ATE calculated from an RCT is automatically reliable, that randomization automatically controls for unobservables, or worst of all, that the calculated ATE is true [of the whole population]" (Deaton & Cartwright, 2018: 10).

The general perspective from which Deaton and Cartwright proceed is that empirical research about causal relationships—including experimentation—requires a broad swath of knowledge about the processes, mechanisms, and causal powers at work in the given domain. Here their view converges philosophically with that offered by Pearl above. This background knowledge is needed in order to interpret the results of empirical research and to assess the degree to which the fndings of a specifc study can plausibly be extrapolated to other populations.

These methodological and logical concerns about the design and interpretation of experiments based on randomized controlled trials make it clear that it is crucial for social scientists to treat RCT methodology carefully and critically. Deaton and Cartwright agree that RCT experimentation is a valuable component of the toolkit of sociological investigation. But they insist that it is crucial to keep several philosophical points in mind. First, there is no "gold standard" method for research in any feld; rather, it is necessary to adapt methods to the nature of the data and causal patterns in a given feld. Second, Cartwright (like most philosophers of science) is insistent that empirical research, whether experimental, observational, statistical, or Millian, always requires theoretical inquiry into the underlying mechanisms that can be hypothesized to be at work in the feld. Only in the context of a range of theoretical knowledge is it possible to arrive at reasonable interpretations of (and generalizations from) a set of empirical fndings.

Many issues of causation in the social and policy sciences cannot be addressed in a controlled laboratory environment. In particular, in many instances, it is impossible to satisfy the condition of random assignment of individuals to control and treatment groups. Much data available for social science and policy research is gathered from government databases (Medicaid, Department of Education, Internal Revenue Service) and was assembled for statistical and descriptive purposes. Hypotheses about the causes of failing schools, ineffective prison reforms, or faulty regulatory systems are not amenable to the strict requirements of randomized controlled trials. However, social and policy scientists have developed practical methods for probing causation in complex social settings using natural experiments, feld experiments, and quasi-experiments.

Quasi-experiments, feld experiments, and natural experiments are sometimes defned as "randomized controlled trials carried out in a real-world setting" (Teele, 2014: 3). This defnition is misleading, because the crucial feature of RCTs is absent in a quasi-experiment: the random assignment of units to control and treatment groups. What quasi-experiments have in common is an effort to replace random assignments of units to control and treatment groups with some other way of stratifying available data that would permit inference about cause and effect. Quasiexperiments involve making use of observational data about similar populations that have been exposed to different and potentially causally relevant circumstances. The researcher then attempts to discover treatment effects based on statistical properties of the two groups. In this volume, Battistin and Bertoni (Chap. 3) describe an ingenious set of constructs to uncover the effects of cheating on educational performance examination scores in Italy, based on what they refer to as "instrumental variables" and "regression discontinuity design." The former is a component of the composition of the control group that can be demonstrated to be random. The authors show how this randomness can be exploited to discover the magnitude of effects of the non-random components in the composition of the control group. The latter term takes advantage of the fact that some data sets (class size in Italy, for example) are "saw-toothed" with respect to a known variable. The example they use is the government policy in Italy that regulates class size. School populations increase linearly, but government policy establishes the thresholds at which a school is required to create a new class. So class size increases from the minimum to the maximum, then declines sharply, and continues. This fact can be exploited to examine school performance in classes currently near the minimum versus classes currently near the maximum. This approach removes school population size from the selection and therefore succeeds in removing a confounding causal infuence, which is exactly what randomization was intended to do.

The reasoning illustrated in Battistin and Bertoni (Chap. 3) is admirable in the authors' effort to squeeze meaningful causal inferences out of a data set that is awash with non-random elements. However, as Battistin and Bertoni plainly demonstrate, it is necessary to be rigorously critical in developing and evaluating these kinds of research designs and inferences. Stanley Lieberson's *Making It Count* (1985) formulates a series of diffcult challenges for the logic of quasi-experimental design that continues to serve as a cautionary tale for quantitative social and policy research. Lieberson believes that there are almost always unrecognized forms of selection bias in the makeup of quasi-experimental research designs that potentially invalidates any possible fnding. Cartwright and Hardie (2012) extend these critical points by underlining the limitations on generalizability (external validity) that are endemic to experimental reasoning. So selection bias is still a possibility that can interfere with valid causal reasoning in the design of a quasi-experiment.

What conclusions should we draw about experiments and quasi-experiments? What is the status of randomized controlled trials as a way of isolating causal relationships, whether in sociology, medicine, or public policy? The answer is clear: RCT methodology is a legitimate and important tool for sociological research, but it is not fundamentally superior to the many other methods of empirical investigation and inference in use in the social sciences. Methodologies supporting the design and interpretation of quasi-experiments are also subject to important methodological cautions in the social science and policy studies. It is necessary to remain critical and refective in assessing the assumptions that underlie any social science research design, including randomized controlled trials and sophisticated quasi-experiments.

#### *2.4.4 Generative Models and Simulation Methods*

Advances in computational power and software have made simulations of social situations substantially more realistic than in previous decades. An early advance took place in general equilibrium theory, leading to a set of models referred to as "computable general equilibrium models." Instead of using a three-sector model to

illustrate the dynamics of a general equilibrium model of a market economy, it is now feasible to embody assumptions for one hundred or more industries and work out the equilibrium dynamics of this substantially more realistic representation of an economic system using a computable model (Taylor, 1990). Of special interest for political scientists and policy scholars is the increasing sophistication of agentbased models (de Marchi and Page, 2008). Kollman et al. (2003) provide a highly informative overview of the current state of the feld in their *Computational Models in Political Economy*. They describe the chief characteristics of an agent-based model in these terms:

The models typically have four characteristics, or methodological primitives: agents are diverse, agents interact with each other in a decentralized manner, agents are boundedly rational and adaptive, and the resulting patterns of outcomes comes often do not settle into equilibria…. The purpose of using computer programs in this second role is to study the aggregate patterns that emerge from the "bottom up" (Kollman et al. 2003: 3).

An often-cited early application of agent-based models was Thomas Schelling's segregation model. Schelling demonstrated that residential segregation was likely to emerge from a landscape in which two populations had tolerant but fnite requirements for the ethnic composition of their neighborhoods (Schelling, 1978). A random landscape populated with a mix of the two populations almost always develops into a segregated landscape of the populations after a number of iterations. Agentbased models can be devised to provide convincing "generative" explanations of a range of collective phenomena; and when developed empirically by calibrating the assumptions of the model to current empirical data, their results can result in reasonable predictions about the near-term future of a given social phenomenon (Epstein, 2006).

We can look at ABM simulation techniques as a form of "mechanisms" theory. A given agent-based model is an attempt to work out the dynamics of individuallevel actions at the meso- and macro-level; and this kind of result can be interpreted as an empirically grounded account of the mechanisms that give rise to a given kind of social phenomenon. This feature of agent-based model methodology gives researchers yet another tool through which to probe the social world for causal relations among social features.

#### **2.5 Realism and Methodological Pluralism**

Let us draw to a close. Here are some chief features of social science research that proceeds in ways consistent with this realist view of causation in the social world:


Central in these ideas is the value of *methodological pluralism*. The ultimate goal of research in the social and policy sciences is to discover causal relationships and causal mechanisms. We want to know how the social world works and how we might intervene to change outcomes that are socially undesirable. There are a wide range of methods of inquiry and validation that are used in the social sciences: ethnographic methods (interviews and participant observation), case study analysis, comparative case study research, models and simulations of social arrangements of interest, and large-scale statistical studies. The philosophical position of methodological pluralism is the idea that there is a place in social and policy research for all of these tools and more besides. What holds them together is the fact that in each case, our ultimate concern is to discover the causal relationships that appear to hold in the social world and the mechanisms that underlie these relationships.

The central conclusion to be drawn here is that multiple methods of empirical investigation are available, and our research efforts will be most productive when we are able to connect empirical fndings with hypotheses about social-causal mechanisms that are both theoretically and observationally supported. And equally importantly, it is crucial for researchers from different methodological traditions to interact with each other so that their underlying assumptions about causation and causal inference can be refned and validated.

#### **Review Questions**


#### **References**

Archer, M. S., Decoteau, C., Gorski, P., Little, D., Porpora, D., Rutzou, T., Smith, C., Steinmetz, G., & Vandenberghe, F. (2016). What is critical realism? *Perspectives, 38*(2), 4–9. http://www. asatheory.org/current-newsletter-online/what-is-critical-realism

Bhaskar, R. (1975). *A realist theory of science*. Leeds Books.


#### *Suggested Readings*


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 3 Counterfactuals with Experimental and Quasi-Experimental Variation**

**Erich Battistin and Marco Bertoni**

**Abstract** Inference about the causal effects of a policy intervention requires knowledge of what would have happened to the outcome of the units affected had the policy not taken place. Since this counterfactual quantity is never observed, the empirical investigation of causal effects must deal with a missing data problem. Random variation in the assignment to the policy offers a solution, under some assumptions. We discuss identifcation of policy effects when participation to the policy is determined by a lottery (randomized designs), when participation is only partially infuenced by a lottery (instrumental variation), and when participation depends on eligibility criteria making a subset of participant and non-participant units as good as randomly assigned to the policy (regression discontinuity designs). We offer guidelines for empirical analysis in each of these settings and provide some applications of the methods proposed to the evaluation of education policies.

#### **Learning Objectives**

By studying this chapter, you will:


E. Battistin

M. Bertoni (\*) University of Padova, Padova, Italy e-mail: marco.bertoni@unipd.it

University of Maryland, College Park, MD, USA e-mail: ebattist@umd.edu

#### **3.1 Introduction**

Do smaller classes yield better school outcomes? To answer this and many similar questions, one needs to compare the outcome in the *status quo* (a large class) to the outcome that would have been observed if the input of interest was set to a different level (a small class). The comparison of students enrolled in small and large classes is always a tempting avenue to answer this causal question. As this comparison involves different students, its validity rests on the assumption that students currently enrolled in small and large classes would have presented the same outcome, on average, had they been exposed to the same number of classmates. This remains an untestable assumption that must be discussed on a caseby-case basis.

The chapter discusses ways to combine policy designs and data to corroborate the validity of this assumption. Sections 3.2 and 3.3 introduce the counterfactual causal analysis talk. They describe the concepts of treatments, potential outcomes and causal effects, and the attributes characterizing the validity of a research design. Section 3.4 is about the beauty and limitations of randomized assignment to "treatment" (e.g., a small class) and paves the way for the discussion in the following sections. Specifcally, these sections deal with methods for causal reasoning when randomization is not feasible. Section 3.5 provides an example of instrumental variation in treatment assignment arising from a natural experiment. Section 3.6 is devoted to the closest cousin to randomization, the regression discontinuity design. Section 3.7 offers some concluding remarks.

Our discussion of empirical methods for causal reasoning is far from exhaustive. For example, we do not discuss research designs that exploit longitudinal data and rely on assumptions on pre-treatment outcome trends (e.g., difference-in-differences and synthetic control methods). Similarly, we do not cover matching methods (see Chap. 4 of this volume). In addition, our presentation will mostly focus on the reasoning underlying design-based identifcation and will only barely touch issues related with estimation. The interested reader can refer to the book by Angrist and Pischke (2008) for a discussion of these topics.

#### **3.2 Causation and Counterfactual Impact Evaluation: The Jargon**

It is useful to start by clarifying what we mean by "causes" and "treatment effects." We consider a population of units indexed by *i*, with *i* = 1, …, *N*. Although our narrative will often consider individuals as the units of analysis, the same setting extends to other statistical units such as households, villages, schools, or municipalities.

#### *3.2.1 Causes as Manipulable Treatments*

In the population we study, some units are exposed to a cause, which is a treatment or intervention that manipulates factors that may affect a certain outcome. For instance, we might be interested in studying whether class size at primary school affects student performance. Class size here is the treatment and performance is the outcome, which is typically measured using standardized tests. In many countries, class size formation depends on grade enrollment so that, across cohorts, the number of students in the class may change because enrollment changes or because a specifc policy affects the regulation. We will use the words "cause", "treatment", or "intervention" interchangeably.

The avenue we take here has some limitations, as not all causes worth considering are manipulable in practice (consider, for example, gender, ethnicity, or genetic traits). Moreover, the design-based approach we describe below may be coarse at times and aimed at shedding light on one particular aspect of a more articulated model. For example, empirical evidence on the causal effects of class size on achievement bundles up the possible contribution of multiple channels that may lead to a better learning environment in small classes. The investigation of channels and mechanisms behind the uncovered effects calls for theories and structural models. The most relevant question to consider turns on the quality of the design-based strategy and on our faith to prop up a more elaborate theoretical framework.

We focus only on binary treatments, that is, we assume that treatment status is described by a binary random variable *Di* taking value one if unit *i* is exposed to treatment ("treated" or "participant") and zero otherwise ("untreated", "non-participant", or "control"). In the class size example, this amounts to considering a setting in which students can be enrolled in small or large classes. The extension to the case of multi-valued or continuous treatment (for example, the number of classmates) is logically identical but requires a more cumbersome notation. More in general, the binary case is always worth of consideration even in a more general context as it helps understand the main challenges in the quest for detecting causal effects. A related issue concerns public policies that are designed as "bundles" of multiple components. In those cases, policy-makers are often interested in disentangling the effect of every component of the policy. We abstract from this problem in our discussion, but emphasize here that the ability to address this question will depend, in general, on the exposure of subjects to different components.

We must take a stand on the reasons why different units end up having a value of *Di* equal to one or zero. This is the so-called "assignment rule" and is at the core of any evaluation study. Assignment to treatment can be totally random. In our class size example, this happens when students are randomized to a small or a large class with equal probability and independently of socio-economic background or past performance. When randomization is not at work, participation to treatment is most likely the result of choices made by the units themselves, administrators of the program, or policy makers. For example, parents can choose to enroll their children in schools with smaller classes in the hope of a better learning environment. Finally, participation to treatment may depend on admission rules that units must comply with. The case of class size formation based on total enrollment is a good example, as the chance of being enrolled in a small class depends on a school's yearly total recruitment. As we shall see, our ability to assess causal effects grows with knowledge of the assignment rule.

#### *3.2.2 Effects as Differences Between Factual and Counterfactual Outcomes*

It is essential to set the stage for a transparent defnition of the treatment effect. To do so, we defne *Yi*(1) and *Yi*(0) as the potential outcomes experienced if unit *i* is treated (*Di* = 1) or untreated (*Di* = 0), respectively. The unit-level treatment effect of *Di* on *Yi* is the difference between *Yi*(1) and *Yi*(0): Δ*i* = *Yi*(1) − *Yi*(0). Decades of empirical studies using micro-data analyses have taught us that treatment effects most likely vary across units or groups of units with very similar demographics. The notation employed here accommodates for this possibility (the manuals by Angrist & Pischke, 2008, and Imbens & Rubin, 2015, use the same approach).

The defnition of Δ*i* unveils the fundamental problem that we face when we want to estimate this quantity from the data. While the two potential outcomes can be logically defned for each unit, they can never be observed simultaneously for the same unit. This is true regardless of the assignment rule and the richness or sample size of data we will ever work with. Specifcally, the data can reveal only *Yi*(1) for units with *Di* = 1 and *Yi*(0) for units with *Di* = 0. We can, therefore, express the observed outcome *Yi* as follows: *Yi* = *Yi*(1)*Di* + *Yi*(0)(1 − *Di*) = *Yi*(0) + *Di*(*Yi*(1) − *Yi*( 0)). As simple as this can be, lack of observability of both potential outcomes implies lack of observability of the unit-level effect Δ*i*. We can think of the unitlevel causal effect as the difference between an observed (factual) and an unobserved (counterfactual) potential outcome. Factual quantities are those that can be computed from the data. Counterfactual quantities can be logically defned but can never be computed from data. For treated units, we observe *Yi* = *Yi*(1) and *Yi*(0) is the counterfactual. The opposite is true for control units, for whom we observe *Yi* = *Yi*(0) and *Yi*(1) is the counterfactual.

One way to get around this limitation is to settle for less than unit-level effects. We might be interested in considering average treatment effects for the population or only for some sub-groups. For instance, we defne the average treatment effect (ATE) as the average of the individual-level treatment effect in the whole population: *ATE* = *E*(*Yi*(1) − *Yi*(0)). This parameter refects our expectation of what would happen if we were to expose to treatment a randomly chosen unit from the population. Alternatively, we can consider the average treatment effect for the treated (ATT), which describes our expectation for units who have been exposed to treatment: *ATT* = *E*(*Yi*(1) − *Yi*(0)| *Di* = 1). Analogously, the average treatment effect for the non-treated (ATNT) is informative about what would have happened to the untreated if they had been exposed to the intervention:

$$ATNT = E\left(Y\_i\left(1\right) - Y\_i\left(0\right) \mid D\_i = 0\right).$$

Whether any of the above causal parameters can be retrieved from the data will have to be discussed on a case-by-case basis our understanding of the assignment rule plays a key role in this discussion.

#### *3.2.3 What the Data Tell (And When)*

Our journey to learn about treatment effects begins by comparing features of the observed outcome *Yi* for treated and control units. For instance, the data reveal the average outcomes for treated units, *E*(*Yi*|*Di* = 1), and control units, *E*(*Yi*|*Di* = 0). Recalling the defnition of potential outcomes, the naïve comparison of average outcomes by treatment group, *E*(*Yi*|*Di* = 1) − *E*(*Yi*| *Di* = 0) = *E*(*Yi*(1)|*Di* = 1) − *E*(*Yi*(0)| *Di* = 0), conveys the correlation between the treatment, *Di*, and the outcome, *Yi*.

The causal interpretation of such naïve comparison is controversial in most cases. To see why, we can add and subtract from the right-hand side of the previous equation the quantity *E*(*Yi*(0)|*Di* = 1). This is a counterfactual quantity, as the outcome *Yi*(0) cannot be observed for treated units, and represents what would have happened to treated units had they not participated to treatment. We can arrange the terms and write:

$$\begin{aligned} E\left(Y\_i \mid D\_i = 1\right) - E\left(Y\_i \mid D\_i = 0\right) &= E\left(Y\_i(1) - Y\_i\left(0\right) \mid D\_i = 1\right) + E\left(Y\_i(0) \mid D\_i = 1\right) \\ -E\left(Y\_i(0) \mid D\_i = 0\right). \end{aligned} \tag{5.1}$$

It follows that the naïve comparison on the left-hand side of Eq. 3.1 is equal to the sum of the ATT and the term *E*(*Yi*(0)| *Di* = 1) − *E*(*Yi*(0)| *Di* = 0), which is often called "selection bias". It is worth noting that this representation does not hinge on any assumptions. It is the result of a simple algebraic trick and, as such, is always true.

Selection bias is an error in the causal reasoning. It is different from zero when, in the absence of treatment, the group with *Di* = 1 would have performed differently from the group with *Di* = 0. The same concept is conveyed by the "correlation is not causation" *motto*: correlation (the naïve treatment–control comparison) has no causal interpretation (that is, it does not coincide with the ATT) unless the selection bias is zero. This reframes the quest for causal effects as a discussion on the existence of selection bias. A non-zero bias follows from having groups defned by *Di* = 1 and *Di* = 0 that are not representative of the same population, in the sense that participation to treatment depends on non-random selection. At the end of the day, selection bias refects compositional differences between treatment and control

units. Taking up our class size example, parents with a strong preference for smaller classes are most likely selected in terms of socio-economic background and demographics. If this selection translates into a better learning potential of their children, forming classes as a refection of parental preference must create dis-homogenous groups of students. In this case, detecting a correlation between class size and achievement might just reveal dis-homogeneity across classes rather than a true causal effect of class size.

Importantly, for the time being, we are agnostic about whether this dishomogeneity concerns characteristics of units that are observed in the data at hand or not. In fact, any strategy that can adjust for compositional differences between treated and control units also corrects for this bias. One leading example to consider here is randomization. When classes are formed by a coin toss, composition is the same. Even when it is because of sampling variability, differences in composition must be as good as random. We will formalize this idea in Sect. 3.4, below. Instead, Chapters 4 and 5 in this volume present methods to alleviate imbalances along observable dimensions and discuss the identifying assumptions that permit to reach causal conclusions once these differences are eliminated.

#### **3.3 Shades of Validity**

The assessment of a causal channel from treatment to the outcome depends on the properties of the research design. In short, this is the toolbox of empirical methods that allows one to distinguish between correlation and causality. Any strategy falling short on this minimum requirement is not a valid option to consider for a good researcher. On the other hand, a good research design must be able to detect precisely the causal relationship of interest. That is, you do not want your design to be underpowered for the size of the treatment effect. Finally, the ideal research design should be able to provide causal statements that apply to the largest share of units in the population and extend to other contexts and times. The concern here is one of generalizability, which is of fundamental importance for offering evidence-based policy recommendations. Causal talk makes use of these three ideas of validity in the development of a research design. This is what we will discuss briefy next. The seminal textbook by Cook and Campbell (1979) provides a deeper treatment of these topics.

#### *3.3.1 Internal Validity: The Ability to Make a Causal Claim from a Pattern Documented in the Data*

Internal validity concerns the ability of assessing whether the correlation between treatment and outcome depicts a causal relationship or if it could have been observed even in the absence of the treatment. Therefore, internal validity is solely concerned with the presence of selection bias. It is achieved under a *ceteris paribus* comparison of units, when all else but the treatment is kept constant between treated and control units. As we discussed above, this calls for the same composition of treatment (small class) and control (large class) units. An internally valid conclusion is the one without selection bias. One of the main advantages of using randomization is that such *ceteris paribus* condition is met by design. Because of this, a properly conducted randomization yields internally valid causal estimates.

#### *3.3.2 Statistical Validity: Measuring Precisely the Relationship Between Causes and Outcomes in the Data*

Statistical validity refers to the appropriate use of statistical tools to assess the extent of correlation between treatment and outcomes. It is fundamentally concerned with standard errors and accuracy in assessing a statistical relationship. The main question addressed by statistical validity is whether the chosen data and techniques of statistical inference can produce precise estimates of very small treatment effects (a statistically precise zero) or if, instead, the research design will likely produce statistical zeros (a statistically insignifcant effect). An insignifcant effect that is statistically different from zero is a powerful oxymoron to summarize the idea underlying statistical validity.

#### *3.3.3 External Validity: The Ability to Extend Conclusions to a Larger Population, over Time and Across Contexts*

External validity is about the predictive value of a particular causal estimate for times, places, and units beyond those represented in the study that produced it. The concern posed by external validity is one of generalizability and out-of-sample prediction. For example, an internally valid estimate for a given sub-group of the population might not be informative about the treatment effect for other (potentially different and policy-relevant) sub-groups. Similarly, ATT is, in general, different from ATE. Replicability of the same results in other contexts and times is of fundamental interest for providing policy recommendations.

#### **3.4 Random Assignment Strengthens Internal Validity**

As Andrew Leigh puts it in his book "*Randomistas: How Radical Researchers Are Changing the World*," (Leigh, 2018) randomized controlled trials (RCTs) use "the power of chance" to assign the groups. Randomization can be achieved by fipping a coin, drawing the shorter straw, or using a computer to randomly assign statistical units to groups. In any of these cases, the result would be the same: the treatment and the control group are random samples from the same population.

Random assignment ensures that treatment and control units are the same in every respect, including their expected *Yi*(0). It follows that, in RCTs, selection bias must be zero since *E*(*Yi*(0)| *Di* = 1) = *E*(*Yi*(0)| *Di* = 0). In other words, what we observe for control units approximates what would have happened to treated units in the absence of treatment. It is worth noting that random assignment does not work by eliminating individual differences, but it rather ensures that the composition of units being compared is the same.

RCTs ensure a *ceteris paribus* (i.e., without confounds) comparison of treatment and control groups. Because of this, an RCT provides an internally valid research design for assessing causality. Evidence in support of this validity can be obtained using pre-intervention measurements. In fact, it is a good practice to collect this information and test the validity of the design by carrying out a battery of "balancing" tests. In a properly implemented randomization, there are no selective differences in the distribution of pre-intervention measurements between treated and control units. This statement does not rule out the possibility of between-group differences arising from sampling variability, which is a problem concerning the statistical validity (that is, the precision of point estimates) of RCTs.

Finally, under random assignment, the naïve comparison will provide internally valid conclusions about the average treatment effect on the treated (ATT), as we have that *E*(*Yi*|*Di* = 1) − *E*(*Yi*| *Di* = 0) = *E*(*Yi*(1) − *Yi*(0)|*Di* = 1). In addition, under randomization, the groups with *Di* = 1 and *Di* = 0 are representative of the same population so that *E*(*Yi*(1) − *Yi*(0)|*Di* = 1) = *E*(*Yi*(1) − *Yi*(0)). This means that the causal conclusions hold for any unit randomly selected from the population.

Random assignment to treatment is not uncommon in numerous felds of the social sciences. One such example is the lottery-based allocation of pupils to schools that are oversubscribed. This alternative to the traditional priority criterion based on proximity should dampen school stratifcation caused by wealthy parents buying houses in the close vicinity of high-quality schools. As a result, among the pool of applicants to a school where oversubscription is resolved by a lottery, getting a seat or not is completely random. Some researchers (see Cullen et al., 2006, for an example) have exploited this to evaluate the educational effects of attending one's preferred school.

Another example is the Oregon Health Insurance Experiment (see Finkelstein et al., 2012). Medicaid is one of the landmark US public health insurance programs and provides care for millions of low-income families. In 2008, the state of Oregon extended coverage of Medicaid by selecting eligible individuals with a lottery. This gave researchers the unique opportunity to provide credible causal estimates of the effect of health insurance eligibility on health care utilization, medical expenditure, medical debt, health status, earnings, and employment.

Although RCTs are considered as the "gold standard" for providing internally valid estimates of causal effects, they are not without shortcomings (see the excellent surveys by Dufo et al., 2008 and Peters et al., 2018). External validity is often perceived as the main limitation and more so for small-scale experiments on very

specifc subpopulations. Bates and Glennerster (2017) propose a framework to discuss generalizability based on four steps: identify the theory behind the program; check if local conditions hold for that theory to apply; evaluate the strength of the evidence for the required general behavioral change; evaluate whether the implementation process can be carried out well. External validity is granted if these four conditions apply in a context different from the one where the experiment was conducted. Statistical validity as well may challenge the signifcance of many smallscale experiments (see Young, 2019).

RCTs have other limitations. Many RCTs are carried out as small-scale pilots that shall be eventually scaled up to the entire population. Causal reasoning in this context must consider the general equilibrium effects arising from this change in scope. These effects are concerned with the possible externalities for non-participants when the policy is implemented on a larger scale and the implications for market equilibria. An additional concern about RCTs is that the sole fact of being "under evaluation" may generate some behavioral response that has nothing to do with a treatment effect.1 Replicability of experiments also has been called into question in many felds of the social sciences (see Open Science Collaboration, 2015, for psychology and Camerer et al., 2016, for economics).

What happens when randomization is not a feasible option? This is the question to which we turn next.

#### **3.5 Internally Valid Reasoning Without RCTs: Instrumental Variation**

#### *3.5.1 A Tale of Pervasive Manipulation*

Randomizations obtained by design are not the only way to ensure ceteris paribus comparisons. Randomness in the assignment to treatment may arise indirectly from natural factors or events independently of the causal channel of interest. Under assumptions that we shall discuss, these factors can be used instrumentally to pin down a meaningful casual parameter. The most important takeaway message here is that we must use assumptions to make up for the lack of randomization. Because of this, much of the simplicity of the research design is lost, and internal validity must be addressed on a case-by-case basis. We will present an example of the toolbox for good empirical investigations using administrative data on student achievement and, further below, class size.

Our working example makes use of standardized tests from INVALSI (a government agency charged with educational assessment) for second and ffth graders in Italian schools for the years 2009–2011. Italy is an interesting case study as it is

<sup>1</sup>Such quirky responses are called "Hawthorne" effects for treated subjects and "John Henry" effects for controls.

characterized by a sharp North–South divide along many dimensions, among which school quality. This divide motivates public interventions to improve school inputs in the South. As testing regimes have proliferated in the country, so has the temptation to cut corners or cheat at the national exam.2 As shown in Fig. 3.1, the South is distinguished by widespread manipulation on standardized tests. INVALSI tests are usually proctored and graded by teachers from the same school, and past work by Angrist et al. (2017) has shown that manipulation takes place during the grading process. Classes with manipulated scores are those where teachers did not grade exams honestly.

Consider the causal effect of manipulation on test scores. As scores are infated, the sign of this effect is obvious. However, the size of the causal effect (that is, by

<sup>2</sup>Cheating or manipulation is not unique to Italy, as discussed in Battistin (2016).

how much scores are infated) is diffcult to measure because manipulation is not the result of random factors. The incentive to manipulate likely decreases as true scores increase so that the distribution of students' true scores is not the same across classes with teachers grading honestly or dishonestly. Again, this is a problem about the composition of the two groups, as treatment classes (with manipulated scores) and control classes (with honest scores) need not be representative of the same population.

When empirical work is carried out using observational data, as it is the case here, it is always illuminating to start from the thought experiment. This is the hypothetical experiment that would be used to measure the causal effect of interest if we had the possibility to randomize units. With observational data, the identifcation strategy consists of the assumptions that we must make to replicate the experimental ideal. The thought experiment in the case of INVALSI data corresponds to distributing manipulation (the treatment) across classes at random. The identifcation strategy here amounts to the set of assumptions needed to mimic the very same experimental ideal *even if* manipulation is not random. How can this be possible?

Econometrics combined with the institutional context come to the rescue. It turns out that about 20% of primary schools in Italy are randomly assigned to external monitors, who supervise test administration and the grading of exams from local teachers in selected classes within the school (see Bertoni et al., 2013, and Angrist et al., 2017, for details on the institutional context). Table 3.1 shows that monitors are indeed assigned to schools using a lottery. Schools with monitors are statistically indistinguishable from the others along several dimensions, including average class size and grade enrollment. For example, the table shows that the average class size in unmonitored classes of the country is 19.812 students. The difference between treated and control classes is as small as 0.035 students and statistically indistinguishable from zero. Additional evidence on the lack of imbalance between schools with and without monitors is in Angrist et al. (2017). In the next section, we discuss how to use the monitoring randomization to learn about the effects of manipulation on scores.

#### *3.5.2 General Formulation of the Problem*

In our example, the class is the statistical unit of analysis and the treatment is manipulation (*Di* = 1 if class scores are manipulated and *Di* = 0 if they are honestly reported). INVALSI has developed a procedure to reveal *Di*, so treatment status is observed in the data. Scores (standardized by grade, year, and subject) are the classlevel outcome,*Yi*. The presence of external monitors is described by a binary random variable *Zi*, with *Zi* = 1 for classes in schools with monitors and *Zi* = 0 otherwise. In the applied econometrics parlance, variables like *Zi*—which is randomly assigned and can infuence treatment status—are called "instruments."

The ordinary least squares (OLS) regression of *Yi* on *Di* summarizes the correlation between manipulation and reported scores. Estimation results obtained from


**Table 3.1** Covariate balance in the monitoring experiment (Angrist et al., 2017)

Columns 1, 3, and 5 show means and standard deviations for variables listed at the left. Other columns report coeffcients from regressions of each variable on a treatment dummy (indicating classroom monitoring), grade and year dummies, and sampling strata controls (grade enrollment at institution, region dummies, and their interactions). Standard deviations for the control group are in square brackets; robust standard errors are in parentheses

a p<0.01, b p<0.05, c p<0.1

OLS are reported in Table 3.2, and a positive correlation between cheating and test score is revealed in all columns. For instance, the value of the coeffcient reported in Column (1) of Panel A implies that when we consider data for the whole of Italy, the average math score in classes with manipulated scores is 1.414 standard deviations higher than in classes where teachers did not manipulate scores.3 However, as discussed above, this result cannot be given any causal interpretation, as the samples with *Di* = 0 and *Di* = 1 are non-randomly selected.

Unlike *Di*, the status *Zi* is randomly assigned. So, it is can be instructive to consider the regression of *Yi* on *Zi*, summarizing the correlation between manipulation and monitoring. As *Zi* is randomly assigned, the latter regression yields the causal effect of monitoring on scores (orthodox empiricists often call this regression the "reduced form equation"). Results in Columns (1)–(3) of Table 3.3 show a negative effect of monitoring on test scores in all columns (see Bertoni et al., 2013). For example, from Column (1) of Panel A, we learn that the average math score in schools with external monitors is 0.112 standard deviations lower than in schools without monitors. Arguably, the negative effect of monitoring on scores passes through a reduction of manipulation.

We need to enrich our causal inference vocabulary to consider potential outcomes based on the 2x2 scenarios that result from the cross-tabulation of *Di* and *Zi*: *Yi*(*Di*, *Zi*). Similarly, we need to adjust the notation to express the idea that *Zi*

<sup>3</sup>Here and in what follows, INVALSI scores are standardized to have zero mean and unit variance by subject and year.


**Table 3.2** Correlation between score manipulation and test scored

All models control for a quadratic polynomial in grade enrollment, segment dummies, and their interactions. The unit of observation is the class. Robust standard errors, clustered on school and grade, are shown in parentheses. Control variables include % female students, % immigrants, % fathers at least high school graduate, % employed mothers, % unemployed mothers, % mother NILF, grade and year dummies, and the proportions of missing values in these variables. All regressions additionally include sampling strata controls (grade enrollment at institution, region dummies, and their interactions). a p<0.01, b p<0.05, c p<0.1


**Table 3.3** Monitoring effects on test scores and score manipulation (Angrist et al., 2017)

Columns 1–3 report the reduced form effects of having a monitor at the institution on test scores. Columns 4–6 show the frst-stage estimates of the effect of having a monitor at the institution on score manipulation. All models control for a quadratic polynomial in grade enrollment, segment dummies, and their interactions. The unit of observation is the class. Robust standard errors, clustered on school and grade, are shown in parentheses. Control variables include % female students, % immigrants, % fathers at least high school graduate, % employed mothers, % unemployed mothers, % mother NILF, grade and year dummies, and proportions of missing values in these variables. All regressions additionally include sampling strata controls (grade enrollment at institution, region dummies, and their interactions). a p<0.01, b p<0.05, c p<0.1

affects *Di*. We defne potential treatments *Di*(0) and *Di*(1) as the treatment status that individual *i* has when exposed to *Zi* = 0 and *Zi* = 1, respectively. In our running example, the realized score *Yi* corresponds to the potential score realized for the observed combination {*Di* = *d*, *Zi* = *z*}, while the realized manipulation *Di* coincides with the potential manipulation realized for the observed value of *Zi* = *z*. For example, *Yi*(1, 1) represents the score that would be recorded for class *i* if teacher grading was dishonest (*Di* = 1) and the school had an INVALSI monitor (*Zi* = 1). Recall that, since only selected classes within the school are monitored, dishonest behavior from teachers in unmonitored classes within the school is always possible (see Bertoni et al., 2013).

Depending on the values taken by *Di*(0) and *Di*(1), we can divide classes into four groups depending on the behavior of teachers grading the exams (see Battistin et al., 2017, for a similar approach):


This classifcation does not hinge on any assumptions and represents the taxonomy of all possible behavioral responses from teachers arising from the monitoring status of the school. The fact that both *Di* and *Zi* are binary limits to four the number of such responses.

#### *3.5.3 Assumptions*

The identifcation strategy for the analysis of natural experiment builds on four assumptions. We now discuss each of them with reference to our specifc running example on the effect of manipulation on test scores. We refer the reader to Angrist and Pischke (2008) for a more general discussion.

#### **3.5.3.1 The "Monotonicity" Assumption**

We begin our investigation by assuming lack of non-complying dishonest teachers (*D*-teachers) in the data. This is a rather innocuous assumption in our context. A violation would represent a quirky behavioral response to the presence of monitors. This assumption is also known as monotonicity condition. It is a restriction on the behavior of units stating that when we move the instrument *Zi* from *z*′ to *z*′′, all agents respond by changing their *Di in the same direction* or by leaving it unaltered. In our case, this assumption implies that (a) honest teachers without monitors at school would be honest teachers even with a monitor and (b) dishonest teachers without monitors at school might grade honestly under the threat of a monitor at school. In the former case, the value of *Di* is unchanged by monitoring and remains zero; in the latter case, the value of *Di* may remain one or turn to zero with monitoring. The events (a) and (b) imply that the distribution of the variable *Di* must move toward zero in the presence of school monitoring. Ruling out the presence of *D*-teachers implies that monitors cannot change the variable *Di* in the opposite direction, from zero to one. This exemplifes why the variable *Zi* must induce a monotone (towards zero) behavior for all teachers.

Monotonicity plays a crucial role in natural experiments: under this assumption, we are left with three compliance types—*C*, *A*, and *N*—whose shares in the populations can be represented by *πC*, *πA*, *πN*, respectively. Manipulators are a mixture of always dishonest teachers (*A*-teachers) and complying dishonest teachers (*C*-teachers) without monitors. Honest teachers are composed of never dishonest teachers (*N-*teachers) and complying dishonest teachers (*C*-teachers) with monitors.

#### **3.5.3.2 The "As Good as Random" Assumption**

A second key relationship among the variables involved arises because schools are randomly assigned to either *Zi* = 1 or *Zi* = 0. Because of the monitoring experiment, the two groups of schools must have the same composition with respect to any variable, including potential outcomes and potential treatment statuses. It, therefore, follows that {*Yi*(1, 1), *Yi*(0, 1), *Yi*(1, 0), *Yi*(0, 0), *Di*(0), *Di*(1)} ⊥ *Zi*. In our case, this "as good as random" assumption holds by design, because monitors have been explicitly assigned at random to schools.

#### **3.5.3.3 The "Exclusion Restriction"**

The causal reasoning builds upon an exclusion restriction. This formalizes the causal construct that the effect of *Zi* on *Yi* shall be solely because of the effect of *Zi* on *Di*. In the example considered here, this restriction can be put across considering the following equations:

$$\begin{aligned} Y\_i \left( \mathbf{0}, 1 \right) &= Y\_i \left( \mathbf{0}, 0 \right), \\ Y\_i \left( \mathbf{1}, 1 \right) &= Y\_i \left( \mathbf{1}, 0 \right). \end{aligned}$$

Therefore, the exclusion restriction implies that there are only two potential outcomes, indexed against *Di*: *Yi*(*Di*). For example, the frst equation implies that scores under honest grading (*Di* = 0) would be the same irrespective of the presence of monitors. Similarly, the second equation implies that dishonest grading (*Di* = 1) would yield the same score independently of school monitoring. The latter

condition would be violated if, for example, always dishonest teachers cheated differently in the presence of external monitors at school. This possibility is discussed in Battistin et al. (2017) and is ruled out in the case of INVALSI data by results in Angrist et al. (2017).

#### **3.5.3.4 The "First-Stage" Requirement**

The assumed causal link from *Di* to *Zi* can be verifed in the data by running an OLS regression of *Di* on *Zi*. In fact, it is a good practice to verify the size and statistical strength of this "frst-stage" regression in any study based on quasi-experimental variation, as the causal chain we have in mind originates from the effect of *Zi* on *Di*. Should we observe any effect of *Zi* on *Yi* but no effect of *Zi* on *Di*, it would be hard to justify that the random variation in *Zi* affected *Yi* via the ability of *Zi* to move *Di*. Estimates of the "frst-stage" relationship between exposure to monitors and manipulation are reported in Columns (4)–(6) of Table 3.3. As expected, score manipulation is less likely in schools where monitors are present. For example, Column (4) of Panel A indicates that the probability of score manipulation is 2.9 percentage points lower in schools of the country with monitors. This is equivalent to a 36% decrease in the probability of manipulation with respect to the mean in nonmonitored schools (equal to 6.4%). As demonstrated by the estimates in Columns (5) and (6) of Table 3.3, this decrease is stronger in Southern Italy than in the North and Center of the country and strongly statistically signifcant.

#### *3.5.4 Better LATE than Never*

To nail down the causal effect of manipulation on scores, we proceed by comparing the expected value of the product *YiDi* for schools with and without monitors. This product is equal to *Yi* for units with *Di* = 1 and to 0 for units with *Di* = 0. Given all the assumptions made so far, we have that:

$$\begin{aligned} E\left(Y\_i D\_i \mid Z\_i = 1\right) &= \pi\_A \ast E\left(Y\_i \left(1\right) \mid A\right), \\\\ E\left(Y\_i D\_i \mid Z\_i = 0\right) &= \pi\_c \ast E\left(Y\_i \left(1\right) \mid C\right) + \pi\_A \ast E\left(Y\_i \left(1\right) \mid A\right). \end{aligned}$$

In the frst equation, neither *C*-teachers or *N*-teachers show up, because for them *Di* = 0 when *Zi* = 1 so that *YiDi* = 0.4 Because of the monotonicity assumptions, there

<sup>4</sup>A consequence of random assignment of *Z*i and of the exclusion restriction is that conditional on the compliance types defned above, potential outcomes are independent of *Z*i, that is, {*Y*i(1), *Y*i(0)} ⊥ *Z*<sup>i</sup> ∣ {*D*i(0), *D*i(1)}. In fact, conditional on a given compliance type, there is a one-to-one mapping between *Z*i and *D*i,, and therefore, knowledge of *Z*i implies knowledge of *D*i.

are no *D*-type teachers either. Therefore, the only group left is that of *A*-teachers, whose fraction in the population is *πA* and for whom we always observe *Yi*(1). In a similar fashion, we do not see *N*-teachers in the second line, as for them, *Di* = 0 when *Zi* = 0. Consequently, after ruling out the presence of *D*-teachers by monotonicity, only *C*- and *A*-teachers show up in this equation. *C*-teachers account for a fraction *πC* of the population, and for them, we observe *Yi*(1) as in this case *Zi* = 0, and therefore, *Di* = 1.

For these very same reasons, if we compare the share of manipulators in schools with and without external monitors, we obtain:

$$\begin{aligned} E\left(D\_i \mid Z\_i = 1\right) &= \pi\_A, \\\\ E\left(D\_i \mid Z\_i = 0\right) &= \pi\_c + \pi\_A. \end{aligned}$$

The former expression suggests that only *A*-teachers have *Di* = 1 when *Zi* = 1; the latter that are both *C*- and *A*-teachers have *Di* = 1 when *Zi* = 0. Analogous expressions can be derived for *E*(*Yi*(0)| *C*), *E*(*Yi*(0)| *N*) and for *πN* if one substitutes *Di* with (1 − *Di*) in the above. We have that:

$$E\left(Y\_i\left(1-D\_i\right)\mid Z\_i=1\right) = \pi\_c \* E\left(Y\_i\left(0\right)\mid C\right) + \pi\_N \* E\left(Y\_i\left(0\right)\mid N\right),$$

$$E\left(Y\_i\left(1-D\_i\right)\mid Z\_i=0\right) = \pi\_N \* E\left(Y\_i\left(0\right)\mid N\right),$$

$$E\left(\left(1-D\_i\right)\mid Z\_i=1\right) = \pi\_c + \pi\_N,$$

$$E\left(\left(1-D\_i\right)\mid Z\_i=0\right) = \pi\_N.$$

In the frst and third equation, *A*-teachers do not show up because they always have *Di* = 1 so that *Yi*(1 − *Di*) = 0 and (1 − *Di*) = 0.5 Because of the monotonicity assumptions, there are no *D*-type teachers either. Therefore, only *C*- and *N*-teachers are left. *C*-teachers account for a fraction *πC* of the population. Since in this case *Zi* = 1, for them, we observe *Di* = 0 and, therefore, *Yi*(0). *N*-teachers are a share *πN* of the population, as for them, *Di* is always equal to 0, and we observe *Yi*(0).

Similarly, in the second and fourth line, we do not see *A*- and *C*-teachers, as for them *Di* = 1 when *Zi* = 0. Consequently, after ruling out the presence of *D*-teachers by monotonicity, only *N*-teachers are left.

<sup>5</sup>A consequence of random assignment of *Z*i and of the exclusion restriction is that conditional on the compliance types defned above, potential outcomes are independent of *Z*i, that is, {*Y*i(1), *Y*i(0)} ⊥ *Z*<sup>i</sup> ∣ {*D*i(0), *D*i(1)}. In fact, conditional on a given compliance type, there is a one-to-one mapping between *Z*i and *D*i,, and therefore, knowledge of *Z*i implies knowledge of *D*i.

By rearranging the equations above, it is easy to obtain:

$$E\left(Y\_i\left(1\right)\mid C\right) = \frac{E\left(Y\_iD\_i\mid Z\_i=1\right) - E\left(Y\_iD\_i\mid Z\_i=0\right)}{E\left(D\_i\mid Z\_i=1\right) - E\left(D\_i\mid Z\_i=0\right)},\tag{3.2}$$

and

$$E\left(Y\_i\left(0\right)\mid C\right) = \frac{E\left(Y\_i\left(1-D\_i\right)\mid Z\_i=1\right) - E\left(Y\_i\left(1-D\_i\right)\mid Z\_i=0\right)}{E\left(\left(1-D\_i\right)\mid Z\_i=1\right) - E\left(\left(1-D\_i\right)\mid Z\_i=0\right)}.\tag{3.3}$$

The difference between the last two expressions yields:

$$E\left(Y\_i\left(1\right) - Y\_i\left(0\right) \mid C\right) = \frac{E\left(Y\_i \mid Z\_i = 1\right) - E\left(Y\_i \mid Z\_i = 0\right)}{E\left(D\_i \mid Z\_i = 1\right) - E\left(D\_i \mid Z\_i = 0\right)},\tag{3.4}$$

which represents the average causal effect of manipulation for classes with teachers who graded honestly because of school monitoring (that is, classes with *C*-teachers). Intuitively, this happens because—in the absence of *D*-teachers—this is the only group of teachers for whom the presence/absence of monitors generates variation in manipulation. Borrowing the defnition by Angrist and Imbens (1994), the parameter on the left-hand side of (3.4) is the *local average treatment effect* (LATE). The word "local" here is motivated by causal conclusions only licensed for a subset of classes in the population.

Importantly, the expression on the right-hand side of Eq. 3.4 involves only the variables observed so that the causal parameter can be identifed from the data. Standard econometric results imply that LATE is estimated by the coeffcient on *Di* in a two-stage least squares (TSLS) regression of *Yi* on *Di*, using *Zi* to instrument for *Di*. 6 Table 3.4 reports the estimates of the LATE parameter in our running example and reveals that manipulation causally increased scores of students assigned to complying dishonest teachers. For example, Column (1) of Panel (A) tells us that score manipulation increases math results in classes with *C*-teachers by 3.827 standard deviations. This causal effect is much larger than the naïve comparison of scores by treatment status reported in Column (1) of Panel A in Table 3.2. Why is it the case? As illustrated in Sect. 3.2.3, the naïve comparison is equal to a causal effect plus selection bias. In this case, selection bias corresponds with the difference in average score of manipulators and non-manipulators if manipulation was not possible at all. As we have argued, manipulation is less likely to occur in classes with higher average true scores. So, selection bias is likely to be negative, that is, *E*(*Yi*(0)| *Di* = 1) < *E*(*Yi*(0)| *Di* = 0).

<sup>6</sup>A similar result applies to the expressions in (3.2) and (3.3) when TSLS regressions of *Y*i*D*i on *D*<sup>i</sup> and of *Y*i(1 − *D*i) on (1 − *D*i), respectively, are considered.


**Table 3.4** Local average treatment effect of score manipulation on test scores

All models control for a quadratic in grade enrollment, segment dummies, and their interactions. The unit of observation is the class. Robust standard errors, clustered on school and grade, are shown in parentheses. Control variables include % female students, % immigrants, % fathers at least high school graduate, % employed mothers, % unemployed mothers, % mother NILF, grade and year dummies, and proportions of missing values in these variables. All regressions include sampling strata controls (grade enrollment at institution, region dummies, and their interactions). a p<0.01, b p<0.05, c p<0.1

#### *3.5.5 External Validity of Causal Conclusions*

Causal conclusions can be drawn for classes with exams graded by *C*-teachers, and TSLS yield internally valid estimates of *E*(*Yi*(1) − *Yi*(0)| *C*). However, we have that *E*(*Yi*(1) − *Yi*(0)| *C*) ≠ *E*(*Yi*(1) − *Yi*(0)) in general. It follows that that the ability to extend causal conclusions to all classes—that is, the external validity of *E*(*Yi*(1) − *Yi*(0)| *C*)—is precluded in general. Using the expressions derived in the previous section, we can write:

$$
\pi\_C = E\left(D\_i \mid Z\_i = 0\right) - E\left(D\_i \mid Z\_i = 1\right),
\tag{3.5}
$$

so that the data is informative about the size of the population for whom this design can provide evidence about a causal effect. This is already a starting point to understand the extent of the external validity problem of causal estimates obtained by LATE. In the case of INVALSI data, the value of *πC* is equal to 2.9% for math and 2.5% for language. This can be seen from Column (4) of Table 3.3, which reports the coeffcient of *Zi* in the frst-stage regression of *Di*on *Zi* using data for all classes in the country. This is equal to the opposite of *πC*. 7 In the South, the share of *C*-teachers grows to 6.2% for math and 4.7% for language, as can be seen from Column (6) of the same table.

In our example, the size of the compliant subpopulation is relatively small. How could one extend the conclusions drawn for a possibly small share of complying dishonest teachers to the remaining classes in the population? We follow Angrist (2004) and note that the data provide information about *E*(*Yi*(1)| *A*) and *E*(*Yi*(0)| N) as well. These quantities can be obtained using expressions like those we derived above (see Battistin et al., 2017, for details). For example, we have that:

$$\begin{aligned} E\left(Y\_i\left(1\right)|A\right) &= E\left(Y\_i \mid D\_i = 1, Z\_i = 1\right), \\\\ E\left(Y\_i\left(0\right)|N\right) &= E\left(Y\_i \mid D\_i = 0, Z\_i = 0\right). \end{aligned}$$

The frst equality holds because—in the absence of *D*-teachers—only *A*-teachers manipulate scores in the presence of monitors. Similarly, only *N*-teachers report honestly without monitors.

If potential outcomes were homogeneous across types in the population, then we would have that *E*(*Yi*(1)| *A*) = *E*(*Yi*(1)| *C*) and *E*(*Yi*(0)| *N*) = *E*(*Yi*(0)| *C*). If these two equalities cannot be rejected from the data, we would feel more confdent about extending the results obtained for classes with complying dishonest teachers to other classes in the population.8

In Table 3.5, we report the comparison of *E*(*Yi*(1)| *C*) vis-à-vis *E*(*Yi*(1)| *A*) and *E*(*Yi*(0)| *C*) vis-à-vis *E*(*Yi*(0)| *N*) for Southern Italy, where the problem of manipulation is more pervasive. While the data does not reject that *E*(*Yi*(1)| *C*) is equal to *E*(*Yi*(1)| *A*), the empirical evidence suggests that *E*(*Yi*(0)| *C*) is much smaller than *E*(*Yi*(0)| *N*). For instance, as reported in Panel A of Table 3.5, for math, we have that *E*(*Yi*(1)| *C*) and *E*(*Yi*(1)| *A*) are very similar and, respectively, equal to 1.426 and 1.236 standard deviations. On the other hand, while *E*(*Yi*(0)| *C*) is equal to −1.662 standard deviations, *E*(*Yi*(0)| *N*) is much higher and equal to −0.655 standard deviations. Therefore, in this case, the data advise against the generalization of the LATE of manipulation on scores outside of the population of complying dishonest teachers.

<sup>7</sup>The number reported in the table is the estimate of *π*C with its sign fipped. This is because the expression for share of C − teachers *π*C is in (5).The coeffcient on *Z*i in the regression of *D*i on *Z*<sup>i</sup> identifes instead *E*(*D*i| *Z*i = 1) − *E*(*D*i| *Z*i = 0), that is, the opposite of *π*C.

<sup>8</sup>Needless to say, full homogeneity of potential outcomes across types requires also that *E*(*Y*i(1)| *N*) = *E*(*Y*i(1)| *C*) and *E*(*Y*i(0)| *A*) = *E*(*Y*i(0)| *C*). However, the data will never reveal *E*(*Y*i(1)| N) and *E*(*Y*i(0)| *A*), as we never get to observe *D*i = 1 for N-teachers and *D*i = 0 for A-teachers. Hence, the latter two conditions cannot be tested empirically.


**Table 3.5** Average potential outcomes by type: South of Italy

*E*(*Yi*(1)| *C*) and *E*(*Yi*(0)| *C*) are obtained from 2SLS regressions as detailed in the text. *E*(*Yi*(1)| *A*) and *E*(*Yi*(0)| *N*) are computed from OLS regressions that estimate *E*(*Yi*| *Di* = 1, *Zi* = 1 ) and *E*(*Yi*| *Di* = 0, *Zi* = 0), respectively. All models control for a quadratic in grade enrollment, segment dummies, and their interactions. The unit of observation is the class. Robust standard errors, clustered on school and grade, are shown in parentheses. Control variables include % female students, % immigrants, % fathers at least high school graduate, % employed mothers, % unemployed mothers, % mother NILF, grade and year dummies, and proportions of missing values in these variables. All regressions include sampling strata controls (grade enrollment at institution, region dummies, and their interactions). a p<0.01, b p<0.05, c p<0.1

#### **3.6 Causal Reasoning with Administrative Rules: The Case of Regression Discontinuity Designs**

#### *3.6.1 Larger Classes, Worse Outcomes?*

The benefts of reducing student–teacher ratios on learning, educational achievement, and eventually long-term labor market outcomes have been of long-standing concern to parents, teachers, and policy-makers. Observational studies often show a negative relationship between class size and student achievement. Yet the conclusions of such studies might be subject to the problem of self-sorting of students into smaller classes.

In many countries, class size formation depends on grade enrollment using a deterministic rule, and Italy is no exception. As discussed in Angrist et al. (2017), until 2008, class size in primary schools in Italy must be between 10 and 25. A reform in 2009 modifed these limits to 15 and 27, respectively. Class formation is regulated by law, and grade enrollment above multiples of the cap to maximum size leads to the formation of a new class. To see this, consider the cap at 25 students in place until 2008. Schools enrolling up to 25 students must form one class. One additional student enrolled after 25 would force principals to form one additional

class, with an average class size of 13 students. The same idea extends to any multiple of 25 students. For example, crossing the 50-student limit is enough to form three classes instead of two and so forth. Because of the regulation in place, class size decreases sharply when enrollment moves from just below to just above multiples of 25. Angrist and Lavy (1999) called this relationship "Maimonides' rule" after the medieval scholar and sage Moses Maimonides who commented on a similar rule in the Talmud.9 Exceptions to the rule in Italy are allowed in some cases. For example, a 10% deviation from the maximum (3 students) in either direction is possible at the discretion of school principals and upon the approval from the Ministry of Education. The presence of students with disabilities or special education needs is often advocated to justify non-compliance with the law. Moreover, principals can form classes smaller than 10 students in the most remote areas of the country.

By allowing actual class size to deviate from the class size mandated by law, these exceptions generate fuzziness in the relationship between actual and predicted class size. This can be seen in Fig. 3.2, where we report the average class size in the country by grade enrollment at school for second graders before 2008.10 The sawtooth-shaped solid line reports predicted class size as a function of enrollment, the Maimonides' rule, while the dots report average actual class size by enrollment. The law predicts class size to be a non-linear and discontinuous function of enrollment. Actual class size follows predicted class size closely and more so for schools enrolling less than 75 students (which is the majority of schools in the country). In addition, discontinuities in the actual class size/enrollment relationship show up at multiples of 25 enrolled students. Given the soft nature of the rule, however, they are weaker than the sharp ones observed for predicted class size.

#### *3.6.2 Visual Interpretation*

Figure 3.3 offers a visual representation of the size of these discontinuities and is constructed using classes at schools with enrollment that falls in a [−12,12] window around the frst four cutoffs shown in Fig. 3.2. Enrollment values in each window are centered to be zero at the relevant cutoff. The y-axis shows average class size conditional on the centered enrollment value shown on the x-axis. The fgure also plots ftted values generated by *locally linear regression* (LLR) fts to class-level

that *<sup>f</sup> <sup>r</sup> int r c igkt gkt gkt gt* – 1 1/ , where *r*gkt is beginning-of-the-year grade enrollment at school

<sup>9</sup>More precisely, let *f*igkt be the predicted class size of class *i* in grade *g* at school *k* in year *t*. We have

*k*, *c*gtis the relevant cap (25 or 27) for grade *g*, and *int*(*x*) is the largest integer smaller than or equal to *x*.

<sup>10</sup>Similar patterns hold also for the period after the 2008 reform and for ffth graders, as shown by Angrist et al. (2017).

#### 3 Counterfactuals with Experimental and Quasi-Experimental Variation

**Fig. 3.2** Class size by enrollment among second-grade students in pre-reform years (Angrist et al., 2017). (It shows actual class size and class size as predicted by the Maimonides' rule in prereform years for second-grade students)

data, as described in Angrist et al. (2017). This representation is convenient in that one can think that small classes are those in schools with grade enrollment to the right of zero. The fgure shows a clear drop at this value. Class size is minimized at about 3–4 students to the right of this value, as we would expect were Maimonides' rule to be tightly enforced.

How can we use these discontinuities in class size to assess a causal effects of class size? School enrollment may be positively correlated with test scores, for example, because larger schools are typically in urban areas, and this relationship need not be linear. However, we would be tempted to infer a causal effect of class size on test score if we observed a discontinuous change in test scores at the *exact* values of enrollment that are multiples of the maximum class size caps, where class size also discontinuously changes. This is the idea underlying the evaluation design that goes by the name of regression discontinuity (RD).

Figure 3.4 exemplifes this idea. It reports the change in average test scores as normalized enrollment moves from below to above the recentered enrollment cutoffs, separately for North and Central Italy and for the South. There is evidence of a positive discontinuity in scores as we move from below to above the cutoff in Southern Italy. Evidence of jumps for the rest of the country is instead much more limited, suggesting the possibility of causal effects of class size on learning mostly for schools in the South.

**Fig. 3.3** Class size by enrollment among second-grade students, centered at the RD cutoffs (Angrist et al., 2017). (Graphs plot residuals from a regression of class size on the following controls: % female students, % immigrants, % fathers at least high school graduate, % employed mothers, % unemployed mothers, % mother NILF, grade and year dummies, and dummies for missing values in these variables. All regressions include sampling strata controls (grade enrollment at institution, region dummies, and their interactions). The solid line shows a one-sided LLR ft.)

The idea underlying the RD design is that the comparison of scores of classes just above and just below the enrollment cutoffs identifed by the Maimonides' rule is informative of effects of class size. Still, not all classes above the cutoffs are small and not all classes below are large, because of discretion in the application of the rule. Intuitively, if compliance with the rule was perfect, then the graphical analysis would already reveal the causal effect. If compliance is not perfect, we may want to use the rule as an instrument for class size formation. Intuitively, the crucial assumption here is that the Maimonides' rule must affect performance at school only because it affects class size formation. A juxtaposition with the identifcation results discussed in Sect. 3.5 reveals that, in this case, the causal effect of class size on learning is identifed only for schools that would form smaller classes because of compliance with the rule. We will come back to this point later in this section.

**Fig. 3.4** Test scores by enrollment among second-grade students, centered at the RD cutoffs (Angrist et al., 2017), (Graphs plot residuals from a regression of test scores on the following controls: % female students, % immigrants, % fathers at least high school graduate, % employed mothers, % unemployed mothers, % mother NILF, grade and year dummies, and proportions of missing values in these variables. All regressions additionally include sampling strata controls (grade enrollment at institution, region dummies, and their interactions). The solid line shows a one-sided LLR ft.)

#### *3.6.3 General Formulation of the Problem*

Following our running example, the class is the statistical unit of analysis and the treatment is class size.11 To ease the narrative, we distinguish between small and large classes and move to the background the possibility of a "continuous" treatment (number of students in class). Small classes will have *Di* = 1 and large classes *Di* = 0. In our narrative, the Maimonides' rule predicts small classes to the right of the recentered cutoffs in Fig. 3.2. Similarly, a large class is predicted for grade enrollment at or below the cutoffs in the same fgure*.* Potential outcomes *Yi*(1) and *Yi*(0) are the average test score that class *i* would get if it was small or large. Grade enrollment at school of class *i* is *ri*. Without loss of generality and consistent with Fig. 3.3, we recentered grade enrollment at zero using a [−12,12] window around cutoffs.

#### **3.6.3.1 The Sharp RD Design**

We start our discussion by assuming full compliance of school principals with the Maimonides' rule. In other words, we pretend that all classes with *ri* at or above zero are small and that all classes with *ri* below zero are large. This is equivalent to

<sup>11</sup>We will drop all indexes other than *i* in what follows. The data contains additional dimensions, but we ignore them for expositional simplicity. One dimension is grade and year. However, scores are standardized by grade and year, so we can ignore them. As a result of this normalization, we end up having repeated measurements over time for classes at the same school. Another dimension is the reform regime. We recenter enrollment to the right cutoff depending on the regulation in place, and we, therefor, abstract from this dimension.

assuming a deterministic relationship between *ri* and class size, which we express using the following notation: *Di* = 1(*ri* ≥ 0). We use this *sharp* setting to write the comparison of outcomes for classes in schools with grade enrollment in a neighborhood of the Maimonides' cutoff. The notion of cutoff proximity will be exemplifed by using limits from below and above zero. Accordingly, the notation *ri* 0 in

what follows should read "just above the Maimonides' cutoff"; the notation *ri* – = 0 is instead "just below the Maimonides' cutoff."

We have that:

$$\begin{aligned} \lim\_{r \to 0^-} E\left(Y\_i \mid r\_i = r\right) &= \lim\_{r \to 0^-} E\left(Y\_i \left(0\right) \mid r\_i = r\right) + \lim\_{r \to 0^-} E\left(D\_i\left(Y\_i \left(1\right) - Y\_i \left(0\right)\right) \mid r\_i = r\right) \\ = \lim\_{r \to 0^-} E(Y\_i \left(0\right) \mid r\_i = r), & \end{aligned}$$

because in classes to the left of the Maimonides' cutoff *Di* is zero so that the second term vanishes. For classes with *ri* above zero, we have:

$$\begin{aligned} &\lim\_{r \to 0^{+}} E\left(Y\_{i} \mid r\_{i} = r\right) = \lim\_{r \to 0^{+}} E\left(Y\_{i} \left(0\right) \mid r\_{i} = r\right) + \lim\_{r \to 0^{+}} E\left(D\_{i}\left(Y\_{i}\left(1\right) - Y\_{i}\left(0\right)\right) \mid r\_{i} = r\right), \\ &= \lim\_{r \to 0^{+}} E\left(Y\_{i}\left(0\right) \mid r\_{i} = r\right) + \lim\_{r \to 0^{+}} E(Y\_{i}\left(1\right) - Y\_{i}\left(0\right) \mid r\_{i} = r), \end{aligned}$$

because *Di* is one deterministically. It follows that the outcome difference between small and large classes at the cutoff can be written as:

$$\begin{aligned} \lim\_{r \to 0^+} E\left(Y\_i \mid r\_i = r\right) - \lim\_{r \to 0^-} E\left(Y\_i \mid r\_i = r\right) &= \lim\_{r \to 0^+} E\left(Y\_i \mid 1\right) - Y\_i \left(0\right) \mid r\_i = r \right), \\ + \lim\_{r \to 0^+} E\left(Y\_i \left(0\right) \mid r\_i = r\right) - \lim\_{r \to 0^-} E\left(Y\_i \left(0\right) \mid r\_i = r\right). \end{aligned}$$

The parallel with the naïve comparison discussed in Eq. 3.1 is striking: the comparison of outcomes for small (*ri* 0 ) and large (*ri* – = 0 ) classes is equal to a causal effect for units just to the right of *ri* = 0:

$$\lim\_{r \to 0^{+}} E(Y\_i(1) - Y\_i(0) \mid r\_i = r),$$

plus a selection bias term:

$$\lim\_{r \to 0^{+}} E\left(Y\_{i}\left(\mathbf{0}\right)|r\_{i} = r\right) - \lim\_{r \to 0^{-}} E(Y\_{i}\left(\mathbf{0}\right)|r\_{i} = r),$$

measuring differences in a local neighborhood of *ri* = 0 that would have occurred even without treatment (i.e., if class size could be only large). What conditions are needed to ensure that the latter term is zero? A closer look at the two terms in the last expression reveals an idea of *continuity*. The condition:

#### 3 Counterfactuals with Experimental and Quasi-Experimental Variation

$$\lim\_{r \to 0^{+}} E\left(Y\_i(\mathbf{0}) \mid r\_i = r\right) = \lim\_{r \to 0^{-}} E(Y\_i(\mathbf{0}) \mid r\_i = r),\tag{3.6}$$

is suffcient to eliminate selection bias and is equivalent to assuming that the relationship between the outcome *Yi*(0) and grade enrollment is continuous at *ri* = 0. This is a mild regularity condition, which most likely holds in most applications, and has a very simple interpretation: our hopes to give any causal interpretation to discontinuities in school performance observed around Maimonides' cutoffs must rest on the assumption that there would have been no discontinuity in performance crossing from *ri* – = 0 over to *ri* 0 had the Maimonides' rule been irrelevant for forming a small class. Assumption (3.6) combined with its counterpart for the *Yi*(1) outcome:

$$\lim\_{r \to 0^{+}} E\left(Y\_{i}\{1\} \mid r\_{i} = r\right) = \lim\_{r \to 0^{-}} E(Y\_{i}\{1\} \mid r\_{i} = r),\tag{3.7}$$

ensures:

$$\lim\_{r \to 0^{+}} E\left(Y\_{i} \mid r\_{i} = r\right) - \lim\_{r \to 0^{-}} E\left(Y\_{i} \mid r\_{i} = r\right) = E\left(Y\_{i}\left(1\right) - Y\_{i}\left(0\right) \mid r\_{i} = 0\right). \tag{5.8}$$

Assumption (3.7) brings to the problem the same regularity condition in (3.6), with a similar interpretation.

The notion of continuity of potential outcomes around Maimonides' cutoffs is evocative of the properties of a full randomization of students to small and large classes in schools with grade enrollment near *ri* = 0. For example, assumption (3.6) can be interpreted as an independence condition between *Yi*(0) and *Di locally* with respect to the Maimonides' cutoff. This is the same sort of condition that we discussed in Sect. 3.4 above. It follows that the internal validity of RD estimates obtained from (3.8) hinges upon the assumption that students in schools with values of *ri* near zero are as good as randomly assigned to small and large classes, as in a local randomized experiment. In Sect. 3.6.4 below, we discuss how potential violations of such condition may arise in practice and propose some tests to assess the plausibility of this assumption.

Compared to a standard randomized experiment, we pay a price in terms of external validity, as RD estimates are internally valid only around Maimonides' cutoffs. The extrapolation of this effect away from the cutoff requires further assumptions about the global shape of the potential outcome functions, that must be discussed on a case-by-case basis. We refer the interested reader to the work by Battistin and Rettore (2008), Angrist and Rokkanen (2015), Dong and Lewbel (2015), and Bertanha and Imbens (2020).

RD estimates of causal effects are obtained from the sample analogue of the expression in (3.8).12 The simplest way to proceed is by comparing the mean sample outcomes for small and large classes within a fxed distance from the Maimonides' cutoff *ri* = 0. The simplicity of this estimator is very appealing, but we may

<sup>12</sup>Lee and Lemieux (2010) provide a thorough discussion of estimation issues in RD designs. We refer the interested reader to their survey for additional details.

encounter statistical validity issues if the data are "sparse" around the Maimonides' cutoff. In fact, we face a trade-off. On the one hand, to enhance statistical validity, we would be tempted to enlarge the width of the neighborhood around the Maimonides' cutoff considered for estimation. On the other hand, by so doing, we would end up using also data points far away from the cutoff. If the relationship between *Yi* and *ri* was not fat, this could endanger the internal validity of the design.

To minimize this trade-off, researchers often rely on semi-parametric estimators. Kernel-weighted local regressions of the outcome on a low-order (linear or quadratic) polynomial in *ri* estimated separately for classes to the left and to the right of *ri* are the most common option (as in Fig. 3.4). By giving a larger weight to data point that are closer to the cutoff and allowing for a non-fat relationship between test scores and enrollment, this estimator permits to enlarge sample size while maintaining internal validity. A fexible parametric regression of *Yi* and *ri* that uses all the available data could also be an option when sample size is small, but this may raise additional issues if high-order polynomials are adopted (see Gelman & Imbens, 2019).

#### **3.6.3.2 The Fuzzy RD Design**

When compliance with the Maimonides' rule is far from perfect, as in Italian primary schools, the sharp setting described in the previous section no longer applies. The fuzziness introduced by non-compliance can be dealt with using the class size predicted from the Maimonides' rule as an instrumental variable for the actual class size. The key assumption underlying this approach is that the regulation on class size formation must infuence standardized tests only because the regulation affects how classes are eventually formed. This is, once again, an exclusion restriction of the form discussed in Sect. 3.5.3.3, above.

A few refnements of this idea are needed in this setting because the Maimonides' rule yields experimental-like variation only near *ri* = 0, implying that the "as good as random" condition in Sect. 3.5.3.2 must hold only *locally* with respect to this point. Complying classes here are those turning small because of compliance with the class size regulation when grade enrollment crosses from *ri* – = 0 over to *ri* 0 (see Sect. 3.5.3.1). Moreover, the frst-stage condition, which ensures that the Maimonides' rule shapes—at least in part—the way classes in Italy are eventually formed stems from the following contrast:

$$\lim\_{r \to 0^{+}} E\left(D\_{i} \mid r\_{i} = r\right) - \lim\_{r \to 0^{-}} E\left(D\_{i} \mid r\_{i} = r\right). \tag{3.9}$$

Eq. 3.9 compares the share of small classes just above and just below the Maimonides' cutoff *ri* = 0. Contrary to the case of a sharp RD, where this contrast is one because of full compliance, fuzziness arising from it makes this quantity lower than one depending on the number of complying classes. The more severe is the extent of non-compliance, the lower will be the external validity of the causal conclusions, as we discussed in Sect. 3.5.5.

The same argument used in Sect. 3.5 extends to the case considered here and can be used to write:

$$E\left[Y\_i\left(1\right) - Y\_i\left(0\right) \mid C, \ r = 0\right] = \frac{\lim\_{r \to 0^+} E\left(Y\_i \mid r = 0\right) - \lim\_{r \to 0^-} E\left(Y\_i \mid r = 0\right)}{\lim\_{r \to 0^+} E\left(D\_i \mid r = 0\right) - \lim\_{r \to 0^-} E\left(D\_i \mid r = 0\right)}.\tag{3.10}$$

 The expression in Eq. 3.10 reveals that a causal effect is retrieved by the ratio of the discontinuities in the outcome and in the treatment probability at the Maimonides' cutoff. This expression bears strong similarities with Eq. 3.4 above, once we assign the role played by the instrumental variable to a dummy for being above the Maimonides' cutoff, *Zi* = 1(*ri* ≥ 0). In fact, Hahn et al. (2001) showed that noncompliance leads the fuzzy RD design to be informative about a local average treatment effect, strengthening this similarity. However, the parameter uncovered by the fuzzy RD is local in two senses. First, it refers only to complying classes. Second, it yields causal conclusions only about classes with a value of *ri* close to 0, limiting external validity even further.

Following the analogy to the instrumental variable case, discussed in Sect. 3.5, estimation of fuzzy RD effects is usually carried out using two-stage least square (TSLS) methods. The general idea is to instrument the treatment dummy *Di* with the dummy *Zi* = 1(*ri* ≥ 0). As in the sharp RD case, researchers can choose to model the relationship between test scores and enrollment using either parsimonious local regressions or fexible global polynomial regressions. In general, and unlike in the sharp RD case, a single TSLS regression is estimated using data on both sides of the cutoff but permitting the polynomial in *ri* to have a different shape on each side of the cutoff. This is done by including interaction terms between the polynomial in *ri* and *Di* that are instrumented by interaction terms between the polynomial in *ri* and *Zi*. 13

The estimated fuzzy RD effects of class size on test scores for our running example are reported in Table 3.6 and show a negative and signifcant effect of class size reduction for compliers at the relevant discontinuity cutoffs. For simplicity, these are obtained using continuous class size. For instance, according to the estimates reported in Column (1) of Panel A, when we consider data for the whole of Italy, we estimate that math scores would increase by an average of 0.06 standard deviations if we decreased class size by 1 unit. As revealed by Columns (2) and (3) and in accordance with Fig. 3.4, the magnitude of such effect is much larger in Southern Italy than in the rest of the country.

<sup>13</sup>Further details about estimation in the fuzzy RD design are discussed in Lee and Lemieux (2010a, b).


**Table 3.6** Local average treatment effect of class size on test scores (Angrist et al., 2017)

The table reports 2SLS estimates using class size cutoffs as an instrument. All models control for a quadratic in grade enrollment, segment dummies, and their interactions. The unit of observation is the class. Class size coeffcients show the effect of 10 students. Robust standard errors, clustered on school and grade, are shown in parentheses. Control variables include % female students, % immigrants, % fathers at least high school graduate, % employed mothers, % unemployed mothers, % mother NILF, grade and year dummies, and dummies for missing values. All regressions include sampling strata controls (grade enrollment at institution, region dummies, and their interactions). a p<0.01, b p<0.05, c p<0.1

#### *3.6.4 Validating the Internal Validity of the Design*

An underlying assumption behind the approach discussed so far is that units cannot precisely manipulate their value of the running variable. For instance, suppose that parents of pupils with above-average ability could perfectly predict enrollment by school and choose to apply only for schools where enrollment is locally above the relevant cutoffs so that their pupils would systematically end up in smaller classes.14 If this was the case, then the RD design would be invalid, as the ability composition of pupils in schools where enrollment is just above and just below the cutoff would be different.

In general, if units cannot precisely manipulate their value of the score, there should be no systematic differences between units with similar values of the score. Therefore, a test for the internal validity of an RD design is to verify whether there are discontinuities in these covariates at the cutoff. If predetermined variables that correlate with the outcome are discontinuous at the cutoff, then continuity of potential outcomes is unlikely to hold. These tests are akin to the "balancing" tests presented for the pure randomization case but are carried out locally, at the cutoff.

Table 3.7 reports results for these tests and shows precisely estimated zero effects of passing the RD cutoffs on some predetermined controls, such as the share of students present in class on the day of the test, supporting the validity of this RD design.

<sup>14</sup>For instance, Urquiola and Verhoogen (2009) show evidence of discontinuities between enrollment and household characteristics in Chilean private schools.


**Table 3.7** Covariate balance for class size discontinuities (Angrist et al., 2017)

Columns 1, 3, and 5 show means and standard deviations for variables listed at the left. Other columns report coeffcients from regressions of each variable on predicted class size, a quadratic in grade enrollment, segment dummies and their interactions, grade and year dummies, and sampling strata controls (grade enrollment at institution, region dummies, and their interactions). Standard deviations for the control group are in square brackets; robust standard errors are in parentheses. a p<0.01, b p<0.05, c p<0.1

#### **3.7 Conclusion**

This chapter has discussed a selected number of approaches among the most popular in the toolbox of good empiricists interested in causal relationships. Randomization, instrumental variation, and discontinuity designs are very closely related members of the same family and, when properly implemented, are thought to yield the most credible estimates of the causal effects of public interventions.

The beauty of randomized assignment is that the composition of "treatment" and "control" groups is by design not driven by any form of selection. In this case, differences in the composition of groups due to sampling variation tend to vanish as sample size increases so that the main concern should be the one of statistical validity. External validity and general equilibrium effects may also be a concern, especially if the intervention has to be implemented in different contexts or scaled up to cover a whole country.

Instrumental variation is a good way to go when randomized assignment is not viable. It seeks sources of random variation that have affected indirectly the chance of receiving "treatment." Clearly, a good source of variability must affect only the treatment assignment and, through this, the outcome of interest. Sources of external random variation affecting at the same time both treatment allocation and the outcome will not allow to distinguish the effect of the instrument on the outcome from the effect of the treatment on the same outcome. As we have made clear, the price to pay for the lack of randomized assignment to treatment is external validity: estimates of causal effects obtained from instrumental variation are limited to the fraction of the population changing the treatment status because of the instrument. How large and comparable this fraction is with respect to the entire population is an

**Fig. 3.5** Score manipulation by enrollment among second-grade students, centered at the RD cutoffs (Angrist et al., 2017). (Graphs plot residuals from a regression of test scores on the following controls: % female students, % immigrants, % fathers at least high school graduate, % employed mothers, % unemployed mothers, % mother NILF, grade and year dummies, and proportions of missing values in these variables. All regressions additionally include sampling strata controls (grade enrollment at institution, region dummies, and their interactions). The solid line shows a one-sided LLR ft)

empirical matter, which should be discussed on a case-by-case basis. We have discussed some test for homogeneity of potential outcomes that allow to extend validity to the whole population of interest.

Finally, the idea of regression discontinuity is most easily put across by thinking of a properly conducted randomization only locally with respect to the discontinuity cutoff. The pros are clear-cut, and the cons concern the external validity of the estimates away from the relevant discontinuity.

What else could possibly go wrong? Books and chapters like this are always written to show a path forward for the implementation of methods. The day-to-day experience as a researcher is way more intricate. For example, Figure 3.5 taken from Angrist et al. (2017) casts doubt on the validity of the assumptions used in our discussion on the effects of class size. It shows that score manipulation also changes discontinuously at *ri* = 0 in Southern Italy, suggesting that teachers in small classes are more likely to manipulate scores. As a result, the alleged causal effect of class size on test scores in Southern Italy discussed above does not refect more learning in smaller classes, but increased manipulation of scores in smaller classes. As discussed by Angrist et al. (2017), these fndings show how class size effects can be misleading even where internal validity is probably not an issue.

This example should prompt the reader to weigh methods with a grain of salt and a proactive attitude: the most credible approach to causal inference is often a combination of different identifcation strategies, and its credibility must stem from the institutional context under investigation rather than clueless statistical assumptions.

#### **Review Questions**

1. Why is the naïve comparison of mean outcomes for treated and control subjects not always informative of a causal effect?


#### **Replication Material**

Access to data and codes is available from the American Economic Association website at: https://www.aeaweb.org/articles?id=10.1257/app.20160267

#### **References**


#### *Suggested Readings*


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 4 Correlation Is Not Causation, Yet… Matching and Weighting for Better Counterfactuals**

**Fedra Negri**

**Abstract** Anyone who has attended a statistics class has heard the old adage "correlation does not imply causation," usually followed by a series of hilarious graphs showing spurious correlations. Even if we strongly agree with it, this reminder has been taken a little too far: it is repeated like a mantra to criticize every observational study as being unable to detect causation behind statistical association. This chapter helps the reader go beyond the mantra, frstly, by explaining that "correlation does not imply causation" in observational studies because of selection bias (i.e. the composition of treatment and control groups follows a non-random selection) and parametric model dependence. Then, it introduces readers to weighting and matching techniques, smart statistical tools for reducing imbalance in the empirical distribution of pretreatment covariates between the treatment and control groups. Lastly, it provides an empirical illustration by focusing on two powerful algorithms: the entropy balancing (EB) and the coarsened exact matching (CEM). The chapter ends with caveats.

#### **Learning Objectives**

After studying this chapter, you should be able to:


F. Negri (\*)

University of Milan, Milan, Italy

University of Milan - Bicocca, Milan, Italy e-mail: fedra.negri@unimib.it


#### **4.1 Introduction**

The very frst notion almost everyone learns in their introductory statistics classes is that "correlation does not imply causation." Usually, students are presented with several examples of spurious correlations to stress that just because two variables move in *tandem*, this does not necessarily signal a causal relationship between them. A typical example is the negative and statistically signifcant correlation between fnal college grades and the amount of time students spend studying (Atkinson et al., 1996), and a number of funny graphs are available online (see: www.tylervigen.com).

Let us put it clearly: we strongly agree that "correlation does not imply causation." However, we also think that in the everyday practice of statistics and especially statistics teaching, the message this sentence carries has been taken a little too far and beyond its scope. In fact, it is repeated like a mantra, to criticize every observational study as being unable to detect causation behind statistical association. The warning "correlation does not imply causation" has made many social scientists feel so uncomfortable with causal inference that they even try to avoid causal language (King et al., 1994: 75–76). Terms such as "effect" or "impact" and verbs such as "to determine" or "to shape" are routinely avoided in scientifc publications and replaced by the calculatedly ambiguous "association" and "link" and "to increase/to decrease" (Hernán, 2018).

Here, two related points should be stressed. First, while "correlation does not imply causation" for sure, "causation *does* imply correlation": if two variables are causally related, a change in one has to trigger a change in the other (Cook & Campbell, 1979; Miles & Shevlin, 2001: 113). Second, even when a statistical association, such as a regression coeffcient, supports our preexisting views, theoretical claims, or a scenario we wish to be true (the so-called confrmation bias), uncertainty about causal inference will never be completely eliminated in observational studies. Thus, a statistical association is a non-suffcient, but still necessary, condition to make a causal claim. This means that we should not give up. Rather, we should provide the reader with the best and most honest estimate of the uncertainty of our causal claims (King et al., 1994: 75–76).

The chapter is structured as follows. Section 4.2 explains why "correlation does not imply causation" in observational studies, i.e. because of selection bias and model dependence. Section 4.3 introduces the reader to matching procedures, smart statistical tools that adjust for composition to correct for selection bias due to observable characteristics (Chap. 3, Sect. 3.2.5 and 3.2.6, provides a more general discussion on selection bias given by unobservable factors). In detail, this section

reviews and simplifes for the reader the latest contributions in the matching literature to emphasize both strengths and limitations of these techniques. Section 4.4 provides an application using the statistical software Stata by describing the algorithms developed by Heinmueller (2012), Iacus et al. (2009, 2011, 2012, 2019). Some *caveats* complete the chapter.

#### **4.2 Not Just a Mantra: Correlation Is Not Causation Because…**

#### *4.2.1 Causal Inference Entails an Identifcation Problem*

Causal inference (i.e. the process by which we make claims about causal relationships) can be thought of as an identifcation problem. Informally, a parameter is identifed in a model if it is theoretically possible to learn its true value with an infnite number of observations (Matzkin, 2007: section 3.1). An identifcation problem arises when we do not have enough information to learn the true value of that parameter even if the sample is infnite (Manski, 1995).

The potential outcomes framework (Rubin, 1974; Holland, 1986) formalizes the causal inference identifcation problem and labels it as the "fundamental problem of causal inference." As discussed at length in Chap. 3 (see Sects. 3.2.2 and 3.2.3 for details), in the potential outcome framework, each unit *i* has two potential outcomes, *Yi*(1) if unit *i* is treated (*Di* = 1) and *Yi*(0) if unit *i* is untreated (*Di* = 0), but only one actual outcome, which depends on the actual treatment that unit *i* receives. Thus, the unit-level treatment effect, *Δi* = *Yi*(1) − *Yi*(0), is impossible to estimate because one of the two potential outcomes cannot be identifed for each unit: for treated units, we observe *Yi* = *Yi*(1) only; for control units, we observe *Yi* = *Yi*(0) only.

Usually, we focus on the average treatment effect (ATE), which is the difference in the pair of potential outcomes averaged over the entire population of interest: *ATE* = *E*(*Yi*(1) − *Yi*(0)). Frequently, the ATE is defned for the subpopulation exposed to the treatment, the average treatment effect for the treated (ATT): *ATT* = *E*(*Yi*(1) − *Yi*(0)| *Di* = 1). Analogously, the average treatment effect for the non-treated (ATNT) is given by: *ATNT* = *E*(*Yi*(1) − *Yi*(0)| *Di* = 0).

However, moving from the unit-level treatment effect to the average treatment effects for the treated (ATT) or the non-treated (ATNT) does not solve our initial causal inference identifcation problem. Indeed, as regards the ATT, no additional amount of data will allow us to observe the average outcome under control for those units in the treatment condition, *E*(Yi(0)|*Di* = 1). Moving to the ATNT, no additional amount of data will allow us to observe the average outcome under treatment for those units in the control condition, *E*(*Yi*(1)| *Di* = 0). The advanced reader may fnd a more formalized discussion in Keele (2015: 314–318).

Thus, the potential outcomes framework helps us in understanding that causal inference entails an unavoidable identifcation problem. Since no additional data can help us in solving this problem, we need to fnd a credible identifcation strategy.

#### *4.2.2 Each Identifcation Strategy Entails a Set of Assumptions*

An identifcation strategy is a research design and entails a set of assumptions, whose plausibility critically depends on the empirical context and should be discussed on a case-by-case basis (Angrist & Pischke, 2009; Morgan & Winship, 2014). The plausibility of some assumptions is testable. Think, for example, of the degree of compliance with the treatment assignment in a randomized experiment or to the frst-stage requirement in a natural experiment with instrumental variation (see Chap. 3, Sect. 3.5.3.4, for details). Unfortunately, this is not always the case: untestable assumptions are unavoidable in causal inference. This is why reasoning about the plausibility of the assumptions entailed by the research design the researcher has chosen is a crucial preliminary step for social scientists aiming at detecting causal effects. This step precedes data collection and statistical analysis and often involves qualitative information about the institutional and empirical context (Keele, 2015: 323–324).

In what follows, we summarize the assumptions needed for statistical estimates to be given a causal interpretation under different research designs. Chapter 3 has already described three common research designs: randomized experiments, where treatment assignment is random, and quasi-experiments providing convincing substitutes to randomization, namely, instrumental variation and regression discontinuity designs (see Chap. 3, Sect. 3.5 and 3.6, for details).

Ideally, randomized experiments can achieve valid and relatively straightforward causal inferences if three requirements are met: (1) random selection of units to be observed from a given population, (2) random assignment of values of the treatment to each observed unit, and (3) large sample size. Random selection (1) avoids selection bias by guaranteeing that the probability of selection from a given population is related to the potential outcomes only by random chance. Combining random selection (1) with large sample size (3) guarantees that the chance that something will go wrong is extremely small. Random assignment (2) guarantees the absence of omitted variable bias even without any control variables included. Here, too, random assignment (2) plus large sample size (3) minimizes the chance of omitted variable bias (Ho et al., 2007: 205–206; see also Chap. 3, Sect. 3.4, for details).

However, social science research usually uses observational data that do not meet all of the three requirements. For example, survey research guarantees large sample size (3), but it is becoming more and more diffcult to randomly select respondents due to increasing nonresponse rates (1), and it is almost impossible to fulfl random assignment requirement (2).

When dealing with observational data, a key further assumption is needed for statistical estimates to be given a causal interpretation: the so-called "selection on observables" assumption (Barnow et al., 1980; Heckman & Robb, 1985). Informally, the researcher has to assume that there is a set of covariates *Xi* such that treatment assignment *Di* is random conditional on these covariates. This assumption is nonrefutable because it cannot be verifed with observed data (Manski, 2007).

This assumption has a number of different names. In econometrics, it is also known as "no omitted variable bias," to emphasize that the model specifcation must include all the variables that are causally prior to the treatment assignment *Di*, that are empirically related to *Di*, and that affect the observed potential outcome *Yi*, conditional on *Di* (Goldberger, 1991; King et al., 1994: 76–82). Remember that only random assignment guarantees that *Di* is independent of any *Xi*, whether measured or not, except by random chance (see Chap. 3, Sect. 3.4).

In statistics, the same assumption is known as "ignorability," to underline that the treatment assignment *Di* and the unobserved potential outcomes are independent after conditioning on a set of covariates *Xi* and the observed potential outcomes so that there are no unobserved factors capable of biasing our estimates (Rubin, 1978). Alternative labels are the "absence of unmeasured confounding" or "conditional independence assumption."

Whatever the name, "selection on observables is a very strong assumption [...]. Generally, selection on observables needs to be combined with a number of different design elements before it becomes credible" (Keele, 2015: 322). Indeed, even admitting that the researcher has in mind the list of "correct" covariates to be incorporated in the model specifcation to meet this assumption, (1) additional data collection may be expensive and onerous, and (2) long model specifcations increase the likelihood of incurring into over or bad control (Angrist & Pischke, 2009: 69). Problem (2) arises when we include in the model specifcation posttreatment covariates. In an experimental setting, it is quite easy to identify pretreatment and posttreatment covariates. With observational data, things get harder. Think, for example, about the items of a survey: if we exclude respondents' exogenous characteristics such as age, gender, citizenship, or parental level of education, it may be hard to state for sure that a covariate is "truly" pretreatment, and thus, it is not a consequence of *Di*. Note that a further complication, known as the "M-bias" (Pearl, 2009a, b) will be discussed at length in Chap. 6.

This section aims to make it clear that there is no easy way-out and there is no magic. The identifcation problem cannot be solved by simply looking at data. Rather, we need to resort to identifcation strategies and each of them rests on a series of assumptions. When the data are observational, a very strong assumption is added to the list: the "selection on observables" one. This is the reason why "correlation [per se] does not imply causation." However, this is not the end of the story: selection on observables can be combined with statistical tools to boost its credibility (Keele, 2015).

#### *4.2.3 Last but not Least: Model Dependence*

Of course, any specifc statistical tool we choose to boost the credibility of our identifcation strategy will make additional assumptions (Ho et al., 2007: 2010–2011).

Let us be honest: as social and political scientists, we usually spend a considerable amount of time in collecting, merging, cleaning, and recoding raw data. Then, we finally load our data set into our favorite statistical software and run several model specifications by using the parametric statistical technique that best fits our data (e.g., OLS, discrete choice models, duration models, etc.).

The main problem with this common procedure is that all parametric methods assume that we know the "right" model specifcation before looking at the estimates. A model is "right" if it is (a really good approximation to) the data-generating process. Otherwise, the model will miss important aspects of reality and inference will be systematically wrong or overly precise.

Instead, what happens in everyday research is that we start from a generic model specifcation suggested by our theoretical framework, previous works, or common sense, and then, we modify it by adding or removing control variables and interaction terms, changing the operationalization of some variables or the functional form, restricting the sample, etc.

Following this inductive procedure, we end up with several alternative estimates of the statistical relationship between our variable of interest and the dependent variable. However, to improve readability, we typically choose no more than ten model specifcations to be included in our written work. This choice, made after looking at the estimates, entails methodological and ethical dilemmas. Moreover, it forces us to convince the readers (and the reviewers) that we have picked up the "right" specifcations, not simply the ones that most supported our starting hypotheses.

Thus, even if rarely admitted, correlation also does not imply causation in observational studies because effect estimates may be model dependent, at least to some degree (Ho et al., 2007).

#### **4.3 Preprocessing Data with Matching to Improve the Credibility of the Estimates**

Imagine we want to estimate the effect of a policy in situations when controlled randomization is unfeasible, unethical, or politically sensitive and there are no convincing natural experiments providing a substitute for randomization such as the ones described in Chap. 3, Sects. 3.5 and 3.6 (i.e., instrumental variation and RDD). In these situations, matching may be a powerful non-parametric technique for boosting the credibility of the estimates. It is grounded on the idea that some serious statistical problems (i.e. model dependence, estimation error, and bias) can be downplayed by dropping heterogeneous observations from the raw data and thus limiting inferences to a carefully selected subsample.

#### *4.3.1 No Magic: What Matching Can and Cannot Do*

Before addressing any technicality, we want to stress a key point about matching. It is not a method of estimation of causal effects, it is "only" a non-parametric statistical tool for preprocessing raw data so that the treatment group becomes as similar as possible to the control group on a set of covariates chosen by the researcher (Arceneaux et al., 2006; Sekhon, 2009). Once treated units have been matched with control ones according to one among the available matching procedures, some method of estimation is needed to obtain an estimate of the causal effect. If the treatment and control groups are exactly balanced on the set of covariates chosen by the researcher (i.e. if the treatment and control covariate distributions are the same), then the method of estimation can credibly be a simple difference in means between the outcomes of the two groups. However, if the two groups are not exactly balanced (i.e. if there are still systematic differences between them, as usually happens), then the researcher has to further adjust the matched sample by using the parametric model they would have used anyway (e.g., Ho et al., 2007; Iacus et al., 2019). Thus, matching is just a convincing way to select the observations on which some methods of estimation should be later applied (with their own additional assumptions).

Exactly as when we interpret the coeffcient of a multivariate regression model as a causal effect, matching procedures are grounded on the strong assumption of selection on observables. This means that it should be theoretically plausible that selection into treatment is completely determined by a set of covariates *Xi* that can be observed by the researcher such that conditioning on *Xi*, the assignment to treatment is as good as random. To put it differently, it should be theoretically plausible that there are not additional unobservable variables capable of pushing units into treatment.1

<sup>1</sup>Given that both matching and regression are based on the selection on observables assumption, the reader may wonder whether matching is really different from a regression with properly identifed control variables. This question is the object of a heated debate among methodologists. Some maintained that both regression and matching are control strategies, and therefore, the differences between the two are unlikely to be of major empirical importance (Angrist & Pischke, 2009: section 3.3.1). Others have pointed out shortcomings of regression relative to matching: Dehejia and Wahba (1999), for example, found that propensity score matching procedures have more closely approximate results from a randomized experiment than regression alone. Further, some have underlined that regression is a parametric approach imposing a global linear relationship between Xs and Y and that it uses all the available observations, thereby involving a certain amount of extrapolation, while matching is a non-parametric approach that discards observations for which a reasonably close match cannot be found (Martini & Sisti, 2009: 221–225). Others have stated that matching involves several choices in its implementation, which could lead to subjectivity in the results. According to Imbens and Wooldridge, "the best practice is to combine linear regression with either propensity score or matching methods" (2008: 19–20) as in this way, the estimated effect will explicitly rely on local, rather than global, linear approximations to the regression function. Even though adjudicating between these views is beyond the scope of this chapter, the application discussed in Sect. 4.4 embraces this last suggestion and thus combines the CEM algorithm with OLS regression.

However, compared to regression, preprocessing raw data with matching eliminates, or at least reduces, the selection bias due to the set of covariates chosen by the researcher, which renders any subsequent parametric adjustment either irrelevant (if balance is fully achieved) or less important (if balance is partially achieved). To put it simply, given the plausibility of the selection on observables assumption, preprocessing data with matching makes causal effect estimates based on the subsequent parametric analyses far less dependent on modeling choices and specifcations. Quoting Ho et al. (2007: 233): "Analysts using preprocessing have two chances to get their analyses right, in that if either the matching procedure or the subsequent parametric analysis is specifed correctly (and even if one of the two is incorrectly specifed), causal estimates will still be consistent" (on this, see also Robins & Rotnitzky, 2001). Moreover, it has been proved that when matching is applied carefully so that *n* is not much smaller in the matched sample than in the original sample, it leads to a reduction in both bias and variance of estimates from subsequent parametric analyses (Rubin & Thomas, 1996; Imai & van Dyk, 2004).

#### *4.3.2 Useful Starting Point: Exact Matching*

Let us formalize the selection on observables assumption. Remember that we aim to estimate the average treatment effect for the treated: *ATT* = *E*(*Yi*(1) − *Yi*(0)| *Di* = 1). Unfortunately, we do not observe the average outcome under control for those units in the treatment condition, *E*(Yi(0)|*Di* = 1). Instead, we observe the average outcome under control for those units in the control condition, *E*(Yi(0)|*Di* = 0). As discussed in Chap. 3, Sect. 3.2.3, a naive comparison of outcomes by treatment status provides a biased estimate of the ATT:

$$\begin{aligned} E\left(Y\_i\left(1\right)\mid D\_i = 1\right) - E\left(Y\_i\left(0\right)\mid D\_i = 0\right) &= \\ E\left(Y\_i\left(1\right) - Y\_i\left(0\right)\mid D\_i = 1\right) + \left[ E\left(Y\_i\left(0\right)\mid D\_i = 1\right) - E\left(Y\_i\left(0\right)\mid D\_i = 0\right) \right] \end{aligned}$$

The frst term on the right-hand side of the equation is the ATT (the quantity we are interested in); the second term is the sample selection bias that accounts for the differences in outcome under control between treated and control units. We already know that only if the three requirements of an ideal RCT are met (i.e. (1) random selection, (2) random treatment assignment, and (3) large sample size), the sample selection bias is zero, and thus, the naive comparison of outcomes by treatment status provides an unbiased estimate of the ATT.

Now, let *Xi* be a set of pretreatment covariates. The selection of the set of covariates *Xi* by the researcher is a critical step. According to the usual rules for avoiding omitted variable bias, *Xi* should include all variables that affect both the treatment assignment *Di* and, controlling for the treatment, the dependent variable *Yi* (this does not mean that every available pretreatment variable should be included in *Xi* because it will reduce effciency). However, to avoid a "posttreatment bias" (King & Zeng,

2007), variables that may be even remotely consequences of the treatment variable should never be included in *Xi* (Cox, 1958: section 4.2; Rosenbaum, 1984; Rosenbaum, 2002: 73–4).

According to the selection on observables assumption, once we condition on *Xi*, assignment to treatment *Di* is independent from the unobserved potential outcomes *Yi*(0) and *Yi*(1):

$$Y\_i(1), Y\_i(0) \perp D\_i|X\_i|$$

Under this assumption, conditioning on *Xi*, the average outcome under control for those units in the control condition is equal to the average outcome under control for those units in the treatment condition:

$$E\left(Y\_i\left(0\right) \mid D\_i = 0, X\_i\right) = E\left(Y\_i\left(0\right) \mid D\_i = 1, X\_i\right) = E\left(Y\_i\left(0\right) \mid X\_i\right)$$

Similarly, conditioning on *Xi*, the average outcome under treatment for those units in the control condition is equal to the average outcome under treatment for those units in the treatment condition:

$$E\left(Y\_i\left(1\right)\mid D\_i = 0, X\_i\right) = E\left(Y\_i\left(1\right)\mid D\_i = 1, X\_i\right) = E\left(Y\_i\left(1\right)\mid X\_i\right)$$

Thus, the expected value of *Yi* is independent from *Di*, given *Xi*. Using the Law of Iterated Expectations, the ATT is given by:

$$\begin{split}ATT &= E\left[Y\_i\left(1\right) - Y\_i\left(0\right) \mid D\_i = 1\right] = E\left[\, \, E\left[Y\_i\left(1\right) - Y\_i\left(0\right) \mid D\_i = 1, \, X\_i\right] \middle|\, D\_i = 1\right] \\ &= E\left[\, \, E\left[Y\_i\left(1\right) \mid D\_i = 1, \, X\_i\right] - E\left[Y\_i\left(0\right) \mid D\_i = 1, \, X\_i\right] \middle|\, D\_i = 1\right] \end{split}$$

The term *E* [ *Yi*(0)| *Di* = 1, *Xi*] is counterfactual, but under the selection on observables assumption, we have:

$$ATT = E\left[\left.E\left[Y\_i\left(1\right)|D\_i = 1, \ X\_i\right] - E\left[Y\_i\left(0\right)|D\_i = 0, \ X\_i\right]|D\_i = 1\right]\right]$$

We can rewrite it as:

$$ATT = E\left[\left.\delta\_x \right| D\_i = 1\right],$$

where *δx* is the difference in means by treatment status at each value of *Xi*.

$$\mathcal{S}\_x = E\left[Y\_i(1) \mid D\_i = 1, \ X\_i\right] - E\left[Y\_i(0) \mid D\_i = 0, \ X\_i\right]$$

This is the identifcation strategy employed by the so-called "exact matching." Informally, it suggests preprocessing the data so that each treated unit is matched with all the available control units that have exactly the same covariates values (do not confuse the exact matching with the one-to-one exact matching, which is more limited because it uses only one control unit for each treated unit). If, after exact matching, a large number of treated units are exactly matched with one or more control units, then we have an exact balance with little ineffciency. This means that a (weighted) difference between the average outcomes of matched treated and control units is suffcient to obtain an unbiased estimate of the ATT. We added "weighted" in parentheses because, since each treated unit can be matched with more than one control unit, a weighted difference in means across exactly matched subclasses is suggested to account for the difference in the number of treated and control units. Beware that if some treated units cannot be matched because there is not at least one control unit with exactly the same covariates values, the exact matching procedure drops these treated units. By dropping some treated units, we alter the *estimand*: it is no longer the ATT, but a more local version of it (Crump et al., 2009; Rubin, 2010). As discussed in Chap. 3, Sect. 3.3.3, this may weaken the external validity of the estimates. This choice is reasonable as long as the researcher is transparent about it and its consequences in terms of the new set of treated units over which the causal effect is defned (Iacus et al., 2012: 5).

If an insuffcient number of exact matches are found, and thus, many treated units have to be discarded, the researcher has to switch to other matching procedures that preprocess the data so that each treated unit is matched with all the available control units that have approximately the same covariates values.

#### *4.3.3 Propensity Score Tautology*

The best practice for approximate matching procedures involves two steps. The frst step drops treated and control units outside the so-called "common support" of both groups. Informally, the common support assumption requires that for any treated unit with given covariate values, it is also possible to observe a control unit with the same (or approximately the same) covariate values. Thus, ensuring common support requires the researcher to drop observations where the empirical density of treated and control units does not overlap since including these observations would require extrapolation from the data, which can generate considerable model dependence.

To accomplish this frst step, King and Zeng (2007) suggest pruning observations from the control group that are outside of the "convex hull" of the treatment group. Informally, with one pretreatment covariate *X*, the convex hull of the treatment group is the range of the subset of observations of *X* that are in the treatment group so that control units with values of *X* greater than *max*(*X*|*T* = 1) or lower than *min*(*X*|*T* = 1) are discarded. Similarly, if any treated units fall outside the convex hull of the control units, these are also discarded (see also Iacus & Porro, 2009 for another conservative way of identifying common support). Remember once more that dropping treated units changes the *estimand*: it is no longer the ATT, but a more local version of it.

The second step matches treated units with control units so that they are as close as possible according to some metric. However, as anticipated, establishing on which dimensions the degree of closeness between treated and control units has to be evaluated (i.e. selecting the pretreatment covariates to be included into *Xi*) is not easy: the researcher might be willing to include a large set of covariates, many of them multivalued or continuous. This problem is known as **"**the curse of dimensionality."

Rosembaum and Rubin (1983) addressed this problem by developing a matching procedure based on the propensity score, defned as the conditional probability of receiving the treatment given the pretreatment covariates selected by the researcher. They start from the usual selection on observables assumption: once we condition on *Xi*, the average potential outcome under control for those units in the treatment condition should be equal to the average potential outcome under control for those units in the control condition. Thus, once we condition on *Xi*, the average potential outcome under control should be the same irrespective of the treatment condition:

$$E\left(Y\_i\left(0\right) \mid D\_i = 1, X\_i\right) = E\left(Y\_i\left(0\right) \mid D\_i = 0, X\_i\right) = E\left(Y\_i\left(0\right) \mid X\_i\right)$$

They move on by demonstrating that if potential outcomes are independent of treatment status conditional on the set of covariates *Xi*, then potential outcomes are also independent of treatment status conditional on a scalar function of the same covariates *Xi*, labelled "propensity score." They collapsed the set of covariates *Xi* into a monodimensional variable that measures, for each unit *i*, the probability of receiving treatment given the values of its set of covariates *Xi*, *P*(*Di* = 1| *Xi*). Usually, it is estimated through a logit or a probit function, which regresses *Di* on a constant term and the set of covariates *Xi* chosen by the researcher, without looking at *Yi*:

$$E\left(Y\_i\left(\mathbf{0}\right) \mid D\_i = 1, P\left(X\_i\right)\right) = E\left(Y\_i\left(\mathbf{0}\right) \mid D\_i = 0, P\left(X\_i\right)\right) = E\left(Y\_i\left(\mathbf{0}\right) \mid P\left(X\_i\right)\right)$$

Approximate matching methods based on the propensity score tend to skip the frst step and to check for common support only after having estimated the propensity score for each observation *i*. Indeed, they drop control units that have a propensity score lower than the minimum or higher than the maximum of the propensity score of the treated units (Khandker et al., 2010).

However, the reader may have already realized that the propensity score solution by Rosembaum and Rubin (1983) is a tautology. The propensity score has been developed to solve the course of dimensionality problem (i.e. too many dimensions to be controlled for to match treated and control units). However, since we do not know the "true" propensity score, it has to be estimated through a probability model that adds the same dimensions as independent variables. Moreover, the only way to check the validity of the specifcation of the estimated propensity score (i.e. to check whether the estimated propensity score is a consistent estimate of the "true" propensity score) is to stratify the sample over small propensity score intervals and then, for each covariate in each interval, test whether the means of the treated and control units are not statistically different. If this is not the case, the researcher has to improve the specifcation of the *probit* or *logit* function he/she used to estimate the propensity score and start again (Dehejia & Wahba, 1999; Becker & Ichino, 2002). Unfortunately, there is no way out from the propensity score tautology: "[I]t works when it works [when matching on the propensity score balances the raw covariates], and when it does not work, it does not work (and when it does not work, keep working at it)" (Ho et al., 2007: 219).

#### *4.3.4 How to Choose Among Matching Procedures?*

Once the researcher has estimated the propensity score for each unit *i*, they have to choose a metric to match treated and control units. Several metrics are available: they vary in the strategy they follow to select the matches and in the weight they associate with each match. Table 4.1 lists the most widely used approximate matching procedures based on the propensity score and provides references for further readings (see also Caliendo & Kopeinig, 2008).

Given this long and non-exhaustive list of approximate matching procedures, how can we choose among them? The methodological literature does not provide a clear-cut answer. Since the main diagnostics of success in matching are balance (i.e. the degree to which the treatment and the control group covariate distributions resemble each other) and the number of observations remaining after preprocessing


**Table 4.1** Commonest approximate matching techniques based on the propensity score

the raw data, a rule of thumb is to preprocess raw data by running as many approximate matching procedures as possible. To avoid any confrmation bias, it is crucial that the researcher performs this comparison without consulting *Y*. Then, they have to choose the procedure that maximizes balance while keeping *n* as large as possible (Ho et al., 2007). As the reader may have foreseen, this search for the matching procedure that maximizes balance and the number of observations may be tedious as the researcher has to manually iterate between the available algorithms (Ho et al., 2007; Iacus et al., 2009; Heinmueller, 2012; King & Nielsen, 2019). Section 4.4 describes two techniques that address this problem.

To assess balance, Ho et al. (2007: 221) suggest the following options: frst, comparing the mean of each variable *Xi* in the treatment group with the mean of each variable in the control group (if one or more of these differences differ by more than a quarter of a standard deviation of the respective *Xi* variable, a better balance is needed) (Cochran, 1968); second, comparing treatment and control histograms one variable at a time; third, using a quantile–quantile plot (QQ plot) for each variable to compare the full empirical distributions of each variable for the treatment and control groups; and lastly, the same QQ plot can be used for the propensity scores of the treatment and control groups. Even if tautological (it relies on the propensity score as a summary of the data to check whether the chosen propensity score matching is adequate), it may be a good low-dimensional summary (Ho et al., 2007: 221–223; see also Rubin, 2001; Austin & Mamdani, 2006; Imai et al., 2008).

One might object that increasing balance by throwing away unmatched observations will reduce statistical effciency (i.e. the mean squared error of the estimated effect might increase). However, "effciency should be a secondary concern for observational students" (Keele, 2015: 325). In a randomized experiment, where selection bias is known to be zero, adding observations simply increases power. On the other hand, in an observational study, increasing the sample size may shrink the confdence intervals to a point that excludes the "true" treatment effect point estimate (Cochran & Chambers, 1965). Moreover, Rosenbaum (2004, 2005) demonstrated that in observational studies, reducing unit heterogeneity reduces both sampling variability and sensitivity to bias from unobserved covariates. Thus, as a rule of thumb, there are reasons for preprocessing raw data through matching procedures in order to reduce heterogeneity between the treatment and control groups according to a set of observable covariates (for theoretical and simulation results, see also Rubin & Thomas, 1992, 1996; Imai & Van Dyk, 2004; Imbens, 2004; Morgan & Winship, 2014; Stuart, 2010).

#### *4.3.5 The End: The Parametric Outcome Analysis*

Having selected the matching algorithm that maximizes balance while keeping *n* as large as possible, the researcher has to move to the usual parametric analysis to obtain a causal effect estimate. Indeed, matching is just a non-parametric statistic tool for reweighting or simply discarding units in the raw data so that the treatment and control groups become as similar as possible on a set of observable covariates or, to put it differently, so that the treatment variable becomes as close as possible to being independent of the background characteristics.

The causal effect can be estimated through a simple (weighted) difference in means between the observed outcomes of the treatment and control groups only if they are exactly balanced. Indeed, the difference in means is equivalent to regressing *Yi* on *Di* without any control variables, thus assuming that *Di* and *Xi* are unrelated. This assumption is plausible only if exact matching has been achieved for the treated units, which is very unlikely. By computing a simple difference in means on a preprocessed sample where there is some remaining imbalance between the treatment and the control groups, we would certainly incur in an omitted variable bias.

Thus, whenever the treatment and control groups are not exactly balanced, the researcher is better off using the same parametric model he/she would have also used on the raw data without preprocessing. Preprocessing data with matching makes causal effect estimates based on the subsequent parametric analyses far less dependent on modeling choices and specifcations (Ho et al., 2007; (Iacus et al., 2019).

#### **4.4 Empirical Illustration**

LaLonde (1986) was the frst to assess the performance of several non-experimental estimators by using experimental data as a benchmark. His experimental data came from the National Supported Work Demonstration (NSWD), a subsidized work experience program that took place in 1975–1976 in the United States. The program consisted into providing trainees with work in a sheltered training environment and then assisting them in fnding regular jobs. To take part in the NSWD, potential participants had to satisfy a set of eligibility criteria intended to identify individuals with signifcant barriers to employment. Then, actual treatment (i.e. the subsidized work experience) was randomized among applicants meeting the eligibility criteria.

Using a simple difference in means between the observed post-intervention earnings of the treatment and control groups, LaLonde (1986) obtained an unbiased estimate of the effect of the subsidized work experience: the program was estimated to increase post-intervention earnings by \$1,794 with a 95% confdence interval of [551; 3,038]. Thus, according to this experimental result, the program was successful. Then, he compared this experimental result to those obtained from several nonexperimental estimators applied to the NSWD observations that received training (treated units only) and a set of control observations constructed ex post from two standard population survey data sets (i.e. CPS and PSID). His fndings show that alternative non-experimental estimators produce very different estimates, most of which deviate substantially from the experimental benchmark.

Several subsequent studies have reanalyzed LaLonde's results, using more recent statistical procedures (e.g., Dehejia & Wahba, 1999; Becker & Ichino, 2002; Smith & Todd, 2005; Iacus et al., 2009, 2012, 2019). Notably, Dehejia and Wahba (1999) restricted LaLonde's data set to individuals from whom data on previous earnings were available in 1974 and compared several matching estimations to a fully saturated in *X* OLS regression (original samples and replication materials are available on Dehejia's page: https://users.nber.org/~rdehejia/nswdata2.html). They concluded that matching procedures dominated fully saturated in *X* regression. However, Smith and Todd (2005) showed that Dehejia and Wahba's fndings came from the specifc sample chosen by the authors, but they did not hold on other samples. Thus, they argued that estimating the causal effect by simply preprocessing data with matching and then computing a (weighted) difference in mean between the treatment and control groups seems not to perform better than a fully saturated in *X* OLS regression. Thus, as explained in the Sect. 4.3.5, after having preprocessed data with the matching procedure that maximizes balance while saving enough of *n*, a method of estimation should be applied. Smith and Todd (2005), for example, found that a combination of matching and difference-in-differences performs the best.

This section summarizes and simplifes for the reader the very latest contribution in this long *querelle* about LaLonde results and matching procedures. Indeed, we focus on the theoretical refnements by Heinmueller (2012) and Iacus et al. (2019) and on the algorithms they, respectively, developed: entropy balancing (EB; Heinmueller & Xu, 2013) and coarsened exact matching (CEM; Blackwell et al., 2009).

EB and CEM are similar from several points of view. Both of these techniques are used in observational studies to preprocess the raw data prior to the estimation of a binary treatment effect under the assumption of selection on observables, and both of them are aimed at improving the covariate balance between the treatment and control groups. Moreover, both techniques overcome the propensity score tautology by requiring the researcher to establish the desired degree of covariate balance before the preprocessing adjustment. Lastly, both of them are computationally effcient and have been proved to reduce model dependence for the subsequent estimation of the treatment effect via parametric outcome analysis.

However, they also differ in important ways. As explained below, CEM coarsens each covariate into substantively meaningful categories identifed ex ante by the researcher and then matches units exactly on this coarsened scale. Treated and control units that cannot be exactly matched are discarded. As the reader already knows, by discarding treated units, CEM changes the *estimand* from the ATT to a more local treatment effect for the remaining treated units (see Iacus et al., 2009 for reasons for why this can be benefcial). On the other hand, EB leaves the *estimand* unchanged because it does not discard treated units. Sections 4.4.1 and 4.4.2. assist readers in getting familiar with these two algorithms.

#### *4.4.1 Entropy Balancing*

EB is a data preprocessing method proposed by Heinmueller (2012). Crudely put, the algorithm works as follows. As usual, the researcher has to identify a set of pretreatment covariates according to his/her substantive knowledge, previous studies, and data availability. Then, for each covariate, the researcher has to pre-specify a potential large set of balance constraints to equate the moments of the covariate distribution between the treatment and the control groups. The moments refer to the mean (frst moment), the variance (second moment), and the skewness (third moment). For example, the researcher can request that the mean values (frst moments) of a set of covariates in the control group exactly equate to the mean values of the same set of covariates in the treatment group. Moreover, they can also include interaction terms such that, for example, the mean of one covariate is balanced across subgroups of another covariate. Lastly, the algorithm searches for a set of entropy weights to satisfy the balance constraints imposed by the researcher, while remaining as close as possible to the uniformly distributed base weights to prevent loss of information.

EB has several attractive features. Its reweighting scheme directly incorporates the researcher's knowledge about the moments in the treatment group and adjusts the weights to balance the covariate distribution exactly in fnite samples, without discarding any treated unit. These are key improvements as they overcome the timeconsuming search over propensity score models without changing the *estimand*. Moreover, the weights that result from EB can be easily incorporated into any standard statistical model the researcher would have used even without the preprocessing step.

To illustrate the functioning of EB, Heinmueller and Xu (2013) rely on the subset of the original LaLonde data set (1986) already used by Dehejia and Wahba (1999). The data set provides information on 185 treated units from the NSWD that were involved in the subsidized work experience and 15,992 non-participants from the Current Population Survey Social Security Administration File (CPS-1). The former constitutes the treatment group, and the latter the control group. Remember that this control group is not the one identifed through randomization during the NSWD. Instead, this control group is built ex post by using the CPS.

The treatment variable, *treat*, is 1 for participants and 0 for nonparticipants. The outcome variable is real earnings in 1978 US dollars (*re78*). The available pretreatment covariates include age (*age*), years of education (*educ*), marital status (*married*), lack of a high school diploma (*nodegree*), race (*black*, *hispanic*), indicator variables for unemployment in 1974 (*u74*) and 1975 (*u75*), and real earnings in 1974 (*re74*) and 1975 (*re75*). The *estimand* is the increase in earnings in 1978 due to the subsidized work experience.

By simply regressing *re78* on the treatment variable and all the controls, it seems that being exposed to the subsidized work experience increased earnings in 1978 by \$1,068 (Fig. 4.1). However, the 95% confdence interval is large enough that the relative estimate is not statistically different from 0. Remember that in this lucky case, we know from the NSWD experimental result that being exposed to the treatment increased earnings in 1978 by \$1,794 with a 95% confdence interval of [551; 3,038]. Thus, the OLS estimate on the raw data is substantially lower than the benchmark effect established on the experimental data.

Thus, the authors preprocess the raw data using EB. The basic syntax of the command *ebalance* requires the researcher to list the treatment variable (*treat*) and the


**Fig. 4.1** OLS regression on the raw data

pretreatment covariates he/she will focus on (e.g., *age*, *educ*, *black*, and *hispan*). The most important option in *ebalance* is *targets(numlist)* as it allows the researcher to impose the balance constraints for the included covariates. In detail, the researcher has to specify a number (1, 2, or 3) that corresponds to the highest covariate moment that should be adjusted for each covariate.

For example, this code requests that the mean, variance, and skewness of the variables *age*, *educ*, *black*, and *hispan* are adjusted: ebalance treat age educ black hispan, targets (3).

As shown in Fig. 4.2, the command returns the number of treated and control units. Note that EB does not discard treated units (185), thus keeping the original *estimand*. Then, it reports descriptive statistics on the mean, variance, and skewness of the selected covariates in the treatment and in the control groups, before and after the reweighting procedure. As requested, the algorithm perfectly balances the two groups on frst-, second-, and third-order moments by ftting the EB weights. By default, the EB weights are stored in a variable named *\_webal* and can be readily used for subsequent analysis.

By writing 2 instead of 3 in parentheses, the algorithm would have balanced only the mean and variance of the same variables; by writing 1, it would have balanced only the mean of the same variables. The command also allows to specify specifc constraints to each variable (see Fig. 4.3). For example, according to the command:

*ebalance* will adjust the frst moment for *age* and *educ*, the frst and the second moments for *black* and the frst, second, and third moments for *hispan*.

To reweight the original LaLonde (1986) data set, Heinmueller and Xu (2013) adjust the sample by including the means, variances, and skewness of all of the 10



**Fig. 4.2** The output of the *ebalance* command

**Fig. 4.3** Options of the *ebalance* command

pretreatment covariates plus squared terms and frst-order interactions of the same 10 covariates and cubed terms for *age*, *educ*, *re74*, and *re75*.

By running the initial OLS regression on the reweighted data, the treatment effect estimate suggests that being exposed to the subsidized work experience increased earnings in 1978 by \$1,761 with a 95% confdence interval of [333; 3,190]. Thus, the simple OLS estimate on the reweighted data is very close to the experimental target answer (\$1,794 with a 95% confdence interval of [551; 3,038]). A similar conclusion may be achieved by regressing *re78* on *treat* only (Fig. 4.4).

#### *4.4.2 Coarsened Exact Matching*

All the matching procedures based on the propensity score (see Table 4.1) assume that the data generation process is based on simple random sampling, which means that drawing repeated hypothetical samples of fxed size *n* < ∞at random from a population of *θ* units with covariates *X*, each sample of *n* observations has an equal probability of selection.

88


**Fig. 4.4** OLS regression on the reweighted data

CEM modifes this assumption by theorizing that the data generation process guarantees stratifed random sampling. Informally, the adjective "stratifed" means that random sampling does not apply directly to the population of *θ* units, but to strata or partitions, within this population, that are identifed by the researcher according to his/her knowledge of the set of covariates *X*. For example, if the set of covariates *X* includes age, gender, and earnings, a stratum may refer to young males making more than \$25,000. Inside this stratum, sample selection should be random (Iacus et al., 2019: 48–49). Then, as with all the other matching procedures, CEM is grounded on the selection on observables and on the common support assumptions (even if inside each stratum; see Iacus et al., 2019: 50–51).

As the reader may have already realized, the emphasis is on the defnition of strata by the researcher. The authors underline that this step is case specifc and critically refects "the knowledge the investigator must have" (Iacus et al., 2019: 54). Indeed, the CEM algorithm helps the researcher in coarsening each variable among the set of pretreatment covariates judged as relevant into substantively meaningful categories that reduce variability while at the same time preserving information. The easiest example is the variable reporting the years of education that can be easily coarsened into categories such as high school, some college, college graduates, etc.

Starting from the LaLonde's data set (1986), Iacus et al. (2009, 2011, 2012, 2019) show that CEM, on average, dominates commonly used matching procedures in a large variety of real and simulated data sets because it reduces imbalance, model dependence, estimation error, bias, variance, and mean square error. Moreover, it usually produces more matched units. Furthermore, while to improve propensity score matching, the researcher has to marginally change and rerun the model, recheck imbalance, and rerun the model again several times (King & Nielsen, 2019), and CEM makes it easier to fnd a specifcation that improves balance. Indeed, strata are explicitly defned ex ante by the researcher according to his/her substantive knowledge on the covariates: reducing maximum imbalance on one variable never has any effect on the maximum imbalance specifed for any of the other variables (Iacus et al., 2012: 21). Let us apply this algorithm to the subset of the original LaLonde data set (1986) already used by Dehejia and Wahba (1999). For an application on the original experimental LaLonde's data set, see Blackwell et al. (2009).

First, we have to assess the imbalance in the original unmatched data through the λ1 statistic (Iacus et al., 2008). This statistic ranges from 0, meaning perfect global balance between the treatment and the control groups, to 1, meaning complete separation between the two (Fig. 4.5).

The *imb* (meaning "imbalance") command works as follows. The researcher has to list the pretreatment covariates they want to focus on (in the example, *age*, *educ*, *black*, and *hispan*), followed by the indication of the treatment variable (*treat*). First, the Stata output shows the λ<sup>1</sup> statistic. In our example, λ<sup>1</sup> = 0.893, thus signaling that the original unmatched data are highly unbalanced. Note that the λ<sup>1</sup> value is not valuable on its own: it is as a point of comparison between matching solutions. The value 0.893 is a baseline reference for the unmatched data. The researcher has to compare the λ<sup>1</sup> value obtained on the matched data to the value 0.893 obtained on the unmatched data and verify whether there has been an increase in balance due to the matching solution (Blackwell et al., 2009: 531).

Then, the output shows additional unidimensional measures of imbalance. The frst column, labelled *L1*, reports the statistics λ<sup>1</sup> computed for each variable separately. The second column, *mean*, reports the difference in means between the treatment and control groups. The remaining columns report the difference in the empirical quantiles of the distributions of the two groups for the 0th, 25th, 50th, 75th, and 100th percentiles for each variable (Fig. 4.6).


**Fig. 4.5** The output of the *imb* command

**Fig. 4.6** The output of the *cem* command

Having obtained our baseline reference λ<sup>1</sup> value for the unmatched data, we apply the CEM algorithm by calling the *cem* command. Crudely put, CEM (1) begins with the covariates *X* and makes a copy *X*<sup>∗</sup> , (2) coarsens *X*<sup>∗</sup> according to userdefned cut-points (or CEM's automatic binning algorithm), (3) creates one stratum per unique observation of *X*<sup>∗</sup> and places each observation in a stratum, and (4) assigns these strata to the original data, *X*, and drops any observation whose stratum does not contain at least one treated and one control unit. Note that (4) may drop both treated and control units, thus changing the *estimand*. However, it does it transparently. Obviously, fewer strata will result in more heterogeneous observations within the same stratum and thus higher imbalance and vice versa (Blackwell et al., 2009: 527).

According to this basic coding, *cem* performs an automated coarsening. The output provides a small table reporting the number of observations in total (*All*), matched and unmatched by treatment group. Notably, two treated observations have been discarded because there were no good matches (thus, the *estimand* is changed).

Then, the output provides information about the imbalance in the matched data. The imbalance in the preprocessed data set is equal to 0.343, which means that the common ground between treated and control units is equal to 66%. Since our baseline reference λ<sup>1</sup> value for the unmatched data is 0.893, this matching solution increases the balance between the two groups. Note that *cem* also generates weights (stored in *cem weights*) for use in the subsequent analysis (Fig. 4.7).

As anticipated, the added value of *cem* is that it allows the researcher to set the coarsening for each variable such that substantively indistinguishable values are grouped together. For example, the code below asks *cem* to match all binary

**Fig. 4.7** The output of the *cem* command with specifc coarsening


**Fig. 4.8** OLS regression with *cem* weights

variables and education exactly and *age* according to standard labor force classes (i.e. 15–19, 20–24, 25–34, 35 and over).

This matching solution differs from that resulting from the automated approach: the balance is worse (from 0.343 in the automated preprocessed data set to 0.431 in the data set preprocessed according to user choices), but all the treated units have been matched. Since we have not achieved a perfect balance between treatment and control groups, it a good idea to adjust for the remaining imbalance via a statistical model. This can be done by taking advantage of the *cem weights* (Fig. 4.8).

By running the initial OLS regression on the reweighted data, the treatment effect estimate suggests that being exposed to the subsidized work experience increased earnings in 1978 by \$1,499 with a 95% confdence interval of [571; 2,428]. Thus, the OLS estimate on the *cem* reweighted data is quite close to the experimental target answer (\$1,794 with a 95% confdence interval of [551; 3,038]).

#### **4.5 Conclusion**

This chapter discussed the necessary assumptions for statistical correlation to justify a causal interpretation when, as is usually the case in practice, controlled randomization is unfeasible or politically sensitive and there are no convincing natural experiments providing a substitute for randomization.

First, the chapter recognized that in observational studies, causal inference is always hazardous due to the strong assumption of selection on observables, which is not easily testable by looking at the raw data (see Oster, 2019 on evaluating OLS robustness to the omitted variable bias). The chapter clarifed that, ultimately, the reliability of the estimates obtained by preprocessing the raw data depends on the validity of the selection on observables assumption, which should be discussed on a case-by-case basis by the researcher. Simply put, once you have identifed a set of covariates *Xi*, you should ask yourself whether there are additional unobservable variables capable of pushing units into treatment. If the answer is "No," then the assumption of selection on observables is theoretically met and matching and weighting procedures may credibly help you in fnding out causal relationships.

Second, the chapter endorsed the practice of preprocessing the raw data through weighting and matching techniques in order to generate well-balanced samples and then applying the same familiar methods of estimation the researcher would have used anyway on the original data set, without preprocessing. In fact, even if these implementation steps do not overcome the selection on observables assumption (i.e. even if your answer to the previous question is "Yes"), weighting and matching techniques will reduce model dependence for the subsequent estimation of the treatment effect via parametric analysis. This means that effect estimates become far less sensitive to seemingly arbitrary choices in model specifcation: if the treatment and control groups are well balanced, slightly different model specifcations are less likely to alter the substantial empirical conclusion of the analysis. Thus, preprocessing the raw data through weighting and matching techniques to generate wellbalanced samples is strongly suggested. In this regard, remember that CEM may discard treated units, while EB leaves the *estimand* unchanged. Even if dropping unmatched treated units can be benefcial (Iacus et al., 2009), also this choice should be openly discussed on a case-by-case basis by the researcher: for example, dropping a treated respondent in a survey may be easier to justify than dropping an entire geographical region.

The hands-on section provided practical guidance for the implementation of the EB and CEM algorithms, respectively. This exercise was performed on the wellknown LaLonde (1986) data set, a lucky case in which we know the "true" average treatment effect from an RCT and we have to match or weight the observations and to adjust the model specifcation so that the estimation becomes as close as possible to the experimental result (see also Costalli & Negri, 2021 for the application of CEM to the evaluation of the effectiveness of peacekeeping missions in the Bosnian civil war).

This is not what usually happens in practice. Since researchers do not know the "true" average treatment effect, they face several decisions during the implementation of the statistical analysis, and there are not always rules of thumb to be applied. The most desirable feature of the implementation steps suggested here is that they force researchers to take the assumptions that have to be met out of the shadows and make them explicit before looking at the outcomes.

Several things may go wrong. For example, researchers may miss a higher dimensional aspect of imbalance when checking lower dimensional summaries. This may affect the estimates. However, since this may also happen without preprocessing, following the steps suggested here should at least not make things worse. Moreover, when the preprocessing implies the loss of some treated unit, researchers should openly discuss the consequences in terms of external validity.

Lastly, as with the techniques covered in Chaps. 3 and 5, the research design discussed here are suitable for establishing a causal relationship between a given variable of interest, the treatment, and an outcome variable, while controlling for confounders. The implementation steps described here are not designed to investigate the paths linking a factor of interest to the outcome (see Chap. 6), to identify the full set of conditions under which the positive outcome is observed (see Chap. 7) or the mechanisms (see Chap. 8) behind the uncovered effects. While recognizing these limitations, these implementation steps help researchers in evaluating whether they are meeting the necessary conditions for generating valid inferences in their applications or how far they go. Good luck with your applied research.

#### **Review Questions**


through a simple difference in means between the observed outcomes of the treatment and control groups?

	- Confrmation bias
	- Selection on observables
	- Model dependence
	- Common support
	- Propensity score
	- Balance

#### **Replication Material**

• Data and replication materials for Section 4.4 are available at https://github.com/ FedraNegri/CorrelationIsNotCausationYet-.git

#### **References**


Goldberger, A. (1991). *A course in econometrics*. Harvard University Press.


#### *Suggested Readings*


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 5 Getting the Most Out of Surveys: Multilevel Regression and Poststratifcation**

**Joseph T. Ornstein**

**Abstract** Good causal inference requires good measurement; even the most thoughtfully designed research can be derailed by noisy data. Because policy scholars are often interested in public opinion as a key dependent or independent variable, paying careful attention to the sources of measurement error from surveys is an essential step toward detecting causation. This chapter introduces multilevel regression and poststratifcation (MRP), a method for adjusting public opinion estimates to account for observed imbalances between the survey sample and population of interest. It covers the history of MRP, recent advances, an example analysis with code, and concludes with a discussion of best practices and limitations of the approach.

#### **Learning Objectives**

By the end of this chapter, you will be able to:


J. T. Ornstein (\*)

Department of Political Science, University of Georgia, Athens, GA, USA e-mail: jornstein@uga.edu

#### **5.1 Introduction**

The book you are reading is a testament to the "credibility revolution" in the social sciences (Angrist & Pischke, 2010), a wide-ranging effort spanning multiple disciplines to develop credible, design-based approaches to causal inference. It is diffcult to overstate the infuence this revolution has had on empirical social science, and the increasing emphasis that policymakers place on informing policy with good research design is a welcome trend.

But as the ongoing replication crisis in experimental psychology (Button et al., 2013) has made clear, good research design alone is insuffcient to yield good science. After all, double-blind randomized control trials are the "gold standard" of credible causal inference, but small sample sizes and noisy measurement have created a situation where many published effect estimates fail to replicate upon further scrutiny (Loken & Gelman, 2017). To confdently detect causation, one needs both good research design *and* good measurement.

Often policy researchers are interested in public opinion on some issue, either as an independent or dependent variable. But the surveys we use to measure public opinion are frequently unrepresentative in some important way. Perhaps their respondents come from a convenience sample (Wang et al., 2015), or non-response bias skews an otherwise random sample. Or perhaps the data is representative of some larger population (i.e., a country-level random sample) but contains too few observations to make inferences about a subgroup of interest. Even the largest US public opinion surveys do not have enough respondents to make reliable inferences about lower-level political entities like states or municipalities. Conclusions drawn from low frequency observations – even in a large sample survey – can be wildly misleading (Ansolabehere et al., 2015).

This presents a challenge for researchers: how to take unrepresentative survey data and adjust it so that it is useful for our particular research question. In this chapter, I will demonstrate a method called *Multilevel Regression and Poststratifcation* (MRP). Using this approach, the researcher frst constructs a model of public opinion (multilevel regression) and then reweights the model's predictions based on the observed characteristics of the population of interest (poststratifcation). In the sections that follow, I will describe this approach in detail, accompanied by replication code in the R statistical language.

As we will see, the accuracy of our MRP estimates depends critically on whether the frst-stage model makes good out-of-sample predictions. The best frst-stage models are *regularized* (Gelman, 2018) to avoid both over- and underftting to the survey data. Regularized ensemble models (Ornstein, 2020) with group-level predictors tend to produce the best estimates, especially when trained on large survey datasets.

#### **5.2 How It Works**

MRP was frst introduced by Gelman and Little (1997), and in the subsequent decades, it has helped address a diverse set of research questions in political science. These range from generating election forecasts using unrepresentative survey data (Wang et al., 2015) to assessing the responsiveness of state (Lax & Phillips, 2012) and local policymakers (Tausanovitch & Warshaw, 2014) to their constituents' policy preferences.

To demonstrate how the method works, the next section will introduce a running example drawn from the Cooperative Election Study (Schaffner et al., 2021), a 50,000+ respondent study of voters in the United States. The 2020 wave of the study includes a question asking respondents whether they support a policy that would "decrease the number of police on the street by 10 percent, and increase funding for other public services." Since police reform is a policy issue on which US local governments have a signifcant amount of autonomy, it would be useful to know how opinions on this issue vary from place to place without having to conduct separate, costly surveys in each area.

The problem is that even a survey as large as CES has relatively few respondents in some small areas of interest. If we wanted to know, for example, what voters in Detroit thought about police reform, a survey of 50,000 people randomly sampled from across the United States will have, on average, only 100 people from Detroit. Estimates from such a small sample will not be very precise. And more importantly, those 100 people are unlikely to be representative of the population of Detroit, since the survey was designed to be representative of the country at large.

The core insight of the MRP approach is that we can use similar respondents from similar areas – e.g., Cleveland or Chicago or Pittsburgh – to improve our inferences about public opinion in Detroit. The way we do so is to frst ft a statistical model of public opinion, using both individual-level predictors (e.g., race, age, gender, education) and group-level predictors (e.g., median income, population density) from our survey dataset. Then, we reweight the predictions of the model to match the observed demographics and characteristics of Detroit. In this way, we get the most out of the information contained in our survey and produce a better estimate of what Detroit residents think than our small sample from Detroit alone could produce.

#### **5.3 Running Example**

To help demonstrate this process, we will draw a small random sample from the CES survey, and, using that sample alone, attempt to estimate state-level public opinion on police reform in each US state. In this way, we can evaluate the accuracy

of our MRP estimates and explore how various refnements to the method improve predictive accuracy. This approach mirrors Buttice and Highton (2013), who use disaggregated responses from large-scale US survey of voters as their target estimand to evaluate MRP's performance. The Cooperative Election Study data is available here, and we'll be using a tidied version of the dataset created by the R/ cleanup-ces-2020.R script.1

```
library(tidyverse)
library(ggrepel)
load('data/CES-2020.RData')
```
This tidied version of the data only includes the 33 states with at least 500 respondents. First, let's plot the percent of CES respondents who supported "defunding" the police2 by state.

```
truth <- ces %>%
  group_by(abb) %>%
  summarize(truth = mean(defund_police))
truth %>%
  mutate(abb = fct_reorder(abb, truth)) %>%
  ggplot(mapping = aes(x=truth, y=abb)) +
  geom_point(alpha = 0.7) +
  labs(x = 'Percent Who Support Police Reform Policy',
       y = 'State') +
  theme_minimal()
```
Oregon is the only state where a majority of respondents supported this policy proposal. And note that Fig. 5.1 likely *overstates* the percent of the total population that support such a policy, since self-identifed Democrats are overrepresented in the CES sample. But nevertheless, these population-level parameters will be a useful target to evaluate the performance of our MRP estimates.

<sup>1</sup>All replication code and data is available on a public repository (https://github.com/joeornstein/ mrp-chapter). Throughout, I will use R functions from the "tidyverse" (Wickham et al., 2019) to make the code more human readable.

<sup>2</sup>Obviously that phrase means different things to different people. In this case, we'll stick with the CES proposed policy of reducing police staffng by 10% and diverting those expenditures to other priorities.

**Fig. 5.1** The percent of CES respondents in each state who support reducing police budgets. These are our target estimands

#### *5.3.1 Draw a Sample*

Suppose that we did not have access to the entire CES dataset, but only to a random sample of 1,000 respondents. How good of a job can we do at estimating those statelevel means?

```
5.3.1. Draw a Sample
sample_data <- ces %>%
 slice_sample(n = 1000)
sample_summary <- sample_data %>%
 group_by(abb) %>%
 summarize(estimate = mean(defund_police),
         num = n())
sample_summary
## # A tibble: 33 x 3
## abb estimate num
## <chr> <dbl> <int>
## 1 AL 0.55 20
## 2 AR 0 4
## 3 AZ 0.438 16
## 4 CA 0.435 85
## 5 CO 0.478 23
## 6 CT 0.375 8
## 7 FL 0.402 87
## 8 GA 0.346 26
## 9 IA 0.308 13
## 10 IL 0.28 50
## # ... with 23 more rows
```
In a sample with only 1,000 respondents, there are several states with very few (or no) respondents. Notice, for example, that this sample includes only four respondents from Arkansas, of whom zero support reducing police budgets. Simply disaggregating and taking sample means is unlikely to yield good estimates, as you can see by comparing those sample means against the truth (Fig. 5.2).

**Fig. 5.2** Estimates from disaggregated sample data

```
# a function to plot the state-level estimates against the truth
compare_to_truth <- function(estimates, truth){
d <- left_join(estimates, truth, by = 'abb')
ggplot(data = d,
       mapping = aes(x=estimate,
                     y=truth,
                     label=abb)) +
  geom_point(alpha = 0.5) +
  geom_text_repel() +
  theme_minimal() +
  geom_abline(intercept = 0, slope = 1, linetype = 'dashed') +
  labs(x = 'Estimate',
       y = 'Truth',
       caption = paste0('Correlation = ', round(cor(d$estimate, d$truth), 2), 
                         ', Mean Absolute Error = ', round(mean(abs(d$estimate - d$
truth)), 3)))
}
compare_to_truth(sample_summary, truth)
```
These are clearly poor estimates of state-level public opinion. The four respondents from Arksansas simply do not give us enough information to adequately measure public opinion in that state. But one of the key insights behind MRP is that the respondents from Arkansas are not the only respondents who can give us information about Arkansas! There are other respondents in, for example, Missouri, that are similar to Arkansas residents on their observed characteristics. If we can determine the characteristics that predict support for police reform using the entire survey sample, then we can use those predictions – combined with demographic information about Arkansans – to generate better estimates. The trick, in essence, is that our estimate for Arkansas will be borrowing information from similar respondents in other states.

The method proceeds in three steps.

#### **5.3.1.1 Step 1: Fit a Model**

First, we ft a model of our outcome, using observed characteristics of the survey respondents as predictors. To demonstrate, let's ft a simple logistic regression model including only four demographic predictors: gender, education, race, and age.

```
model <- glm(defund_police ~
              gender + educ + race + age,
            data = sample_data,
            family = 'binomial')
```
#### **5.3.1.2 Step 2: Construct the Poststratifcation Frame**

The poststratifcation stage requires the researcher to know (or estimate) the joint frequency distribution of predictor variables in each state. This information is stored in a "poststratifcation frame," a matrix where each row is a unique combination of characteristics, along with the observed frequency of that combination. Often, one constructs this frequency distribution from Census micro-data (Lax & Phillips, 2009). For our demonstration, I will compute it directly from the CES.

```
psframe <- ces %>%
 count(abb, gender, educ, race, age)
head(psframe)
## # A tibble: 6 x 6
## abb gender educ race age n
## <chr> <chr> <chr> <chr> <dbl> <int>
## 1 AL Female 2_year Black 26 1
## 2 AL Female 2_year Black 27 2
## 3 AL Female 2_year Black 29 1
## 4 AL Female 2_year Black 31 1
## 5 AL Female 2_year Black 34 2
## 6 AL Female 2_year Black 35 2
```
#### **5.3.1.3 Step 3: Predict and Poststratify**

With the model and poststratifcation frame in hand, the fnal step is to generate frequency-weighted predictions of public opinion. For each cell in the poststratifcation frame, append the model's predicted probability of supporting police defunding.

psframe\$predicted\_probability <- predict(model, psframe, type = 'response')

Then, the poststratifed estimates are the frequency-weighted means of those predictions.

```
poststratified_estimates <- psframe %>%
  group_by(abb) %>%
  summarize(estimate = weighted.mean(predicted_probability, n))
```
Let's see how these estimates compare with the known values (Fig. 5.3).

compare\_to\_truth(poststratified\_estimates, truth)

These estimates, though still imperfectly correlated with the truth, are much better than the previous estimates from disaggregation. Notice, in particular, that the estimate for Arkansas went from 0% to roughly 39%, refecting the signifcant improvement that comes from using more information than the four Arkansans in our sample can provide.

**Fig. 5.3** Underft MRP estimates from complete pooling model

But we can still do better. In the following sections, I will show how successive improvements to the frst-stage model can yield more reliable poststratifed estimates.

#### *5.3.2 Beware Overftting*

A common instinct among social scientists building models is to take a "kitchen sink" approach, including as many explanatory variables as possible (Achen, 2005). This is counterproductive when the objective is out-of-sample predictive accuracy. To illustrate, let's estimate a model with a separate intercept term for each state – a "fxed effects" model. Because our sample contains several states with very few observations, these state-specifc intercepts will be overft to sampling variability (Fig. 5.4).

**Fig. 5.4** Overft MRP estimates from fxed effects model

```
# fit the model
model2 <- glm(defund_police ~
              gender + educ + race + age +
                abb,
            data = sample_data,
            family = 'binomial')
# construct the poststratification frame
psframe <- ces %>%
  count(abb, gender, educ, race, age)
# make predictions
psframe$predicted_probability <- predict(model2, psframe, type = 'response')
# poststratify
poststratified_estimates <- psframe %>%
  group_by(abb) %>%
  summarize(estimate = weighted.mean(predicted_probability, n))
```
compare\_to\_truth(poststratified\_estimates, truth)

These poststratifed estimates perform about as well as the disaggregated estimates from Fig. 5.2. Because each state's intercept is estimated separately, the overft model foregoes the advantages of "partial pooling" (Park et al., 2004), borrowing information from respondents in other states. Note that the estimate for Arkansas is once again 0%.

#### *5.3.3 Partial Pooling*

A better approach is to estimate a multilevel model (alternatively known as "varying intercepts" or "random effects" model), including group-level covariates. In the model below, I estimate varying intercepts by US Census division, including the state's 2020 Democratic vote share as a covariate. The result is a marked improvement over Fig. 5.3 (particularly for West Coast states like Oregon, Washington, and California) (Fig. 5.5).

**Fig. 5.5** MRP estimates from model with partial pooling

```
library(lme4)
# fit the model
model3 <- glmer(defund_police ~ gender + educ + race + age +
                (1 + biden_vote_share | division), 
                 data = sample_data,
                 family = 'binomial')
# construct the poststratification frame
psframe <- ces %>%
  count(abb, gender, educ, race, age, division, biden_vote_share)
# make predictions
psframe$predicted_probability <- predict(model3, psframe, type = 'response')
# poststratify
poststratified_estimates <- psframe %>%
  group_by(abb) %>%
  summarize(estimate = weighted.mean(predicted_probability, n))
compare_to_truth(poststratified_estimates, truth)
```
#### *5.3.4 Sample Size Is Critical*

MRP's performance depends heavily on the quality and size of the researcher's survey sample. Up to now, we've been working with a random sample of 1,000 respondents, and though the resulting estimates are better than the raw sample means, their performance has been somewhat underwhelming. Suppose instead we had a sample of 5,000 respondents (Fig. 5.6).

```
sample_data <- ces %>%
  slice_sample(n = 5000)
# fit the model
model3 <- glmer(defund_police ~ gender + educ + race + age +
                (1 + biden_vote_share | division), 
                 data = sample_data,
                 family = 'binomial')
# construct the poststratification frame
psframe <- ces %>%
  count(abb, gender, educ, race, age, division, biden_vote_share)
# make predictions
psframe$predicted_probability <- predict(model3, psframe, type = 'response')
# poststratify
poststratified_estimates <- psframe %>%
  group_by(abb) %>%
  summarize(estimate = weighted.mean(predicted_probability, n))
compare_to_truth(poststratified_estimates, truth)
```
**Fig. 5.6** Poststratifed estimates with a survey sample of 5,000

Now MRP really shines. With more observations, the frst-stage model can better predict opinions of out-of-sample respondents, which dramatically improves the poststratifed estimates.

#### *5.3.5 Stacked Regression and Poststratifcation (SRP)*

Ultimately, the accuracy of one's poststratifed estimates depends on the out-ofsample predictive performance of the frst-stage model. As we've seen above, the challenge is to thread the needle between overftting and underftting. Several recent papers (Bisbee, 2019; Broniecki et al., 2022; Ornstein, 2020) have shown that approaches from machine learning can help to automate this process, particularly with large survey samples.

In the code below, I'll demonstrate how an *ensemble* of models – using the same set of predictors but different methods for combining them into predictions – can yield superior performance to a single multilevel regression model. In particular, I will ft a "stacked regression" (Breiman, 1996), which makes predictions based on a weighted average of multiple models, where the weights are assigned by crossvalidated prediction performance (van der Laan et al., 2007). The literature on ensemble models is extensive, but for good entry points, I recommend Breiman (1996), Breiman (2001), and Montgomery et al. (2012) (Fig. 5.7).

**Fig. 5.7** Estimates from an ensemble frst-stage model

```
# construct the poststratification frame
psframe <- ces %>%
  count(abb, gender, educ, race, age, division, biden_vote_share)
# fit the model (an ensemble of random forest and logistic regression)
library(SuperLearner)
SL.library <- c("SL.ranger", "SL.glm")
X <- sample_data %>%
  select(gender, educ, race, age, division, biden_vote_share)
newX <- psframe %>%
  select(gender, educ, race, age, division, biden_vote_share)
sl <- SuperLearner(Y = sample_data$defund_police,
                       X = X,
                       newX = newX, 
                       family = binomial(),
                       SL.library = SL.library, verbose = FALSE)
# make predictions
psframe$predicted_probability <- sl$SL.predict
# poststratify
poststratified_estimates <- psframe %>%
  group_by(abb) %>%
  summarize(estimate = weighted.mean(predicted_probability, n))
compare_to_truth(poststratified_estimates, truth)
```
The performance gains in Fig. 5.7 refect the improvement that comes from modeling "deep interactions" in the predictors of public opinion (Ghitza & Gelman, 2013). If, for example, income better predicts partisanship in some states but not in others (Gelman et al., 2007), then a model that captures that moderating effect will produce better poststratifed estimates than one that does not. Machine learning techniques like random forest (Breiman, 2001) are especially useful for automatically detecting and representing such deep interactions, and stacked regression and poststratifcation (SRP) tends to outperform MRP in simulations, particularly for training data with large sample size (Ornstein, 2020).

#### *5.3.6 Synthetic Poststratifcation*

Researchers rarely have access to the entire joint distribution of individual-level covariates. This can be limiting, since there may be a variable that one would like to include in the frst-stage model but cannot because it is not in the poststratifcation frame. Leemann and Wasserfallen (2017) suggest an extension of MRP, which they (delightfully) dub Multilevel regression and synthetic Poststratifcation' (MrsP). Lacking the full joint distribution of covariates for poststratifcation, one can instead create a *synthetic* poststratifcation frame by assuming that additional covariates are statistically independent of one another. So long as the frst-stage model is linear additive, this approach yields the same predictions as if you knew the true joint distribution!3 And even if the frst-stage model is not linear additive, simulations suggest that the improved performance from additional predictors tends to overcome the error introduced in the poststratifcation stage.

Here are some CES covariates that we might want to include in our model of police reform:


These variables are likely to be useful predictors of opinion about police reform, and the frst-stage model could be improved by including them. But there is no dataset (that I know of) that would allow us to compute a state-level joint probability distribution over every one of them. Instead, we would typically only know the marginal distributions of each covariate (e.g., the percent of a state's residents that are military households or the percent that live in urban areas). So a synthetic poststratifcation approach may prove helpful.

To create a synthetic poststratifcation frame, we create a set of marginal probability distributions and multiply them together.4

<sup>3</sup>See Ornstein (2020) Appendix A for mathematical proof.

<sup>4</sup>The SRP package contains a convenience function for this operation (see the vignette for more information).

```
# fit the model
model4 <- glmer(defund_police ~ gender + educ + race + age +
                  pew_religimp + homeowner + urban +
                  parent + military_household +
                  (1 + biden_vote_share | division),
                data = sample_data,
                family = 'binomial')
# construct the poststratification frame
psframe <- ces %>%
  count(abb, gender, educ, race, age,
        division, biden_vote_share) %>%
  # convert frequencies to probabilities
  group_by(abb) %>%
  mutate(prob = n/sum(n))
# find the marginal distribution for each new variable
marginal_pew_religimp <- ces %>%
  count(abb, pew_religimp) %>%
  group_by(abb) %>%
  mutate(marginal_pew_religimp = n/sum(n))
marginal_urban <- ces %>%
  count(abb, urban) %>%
  group_by(abb) %>%
  mutate(marginal_urban = n/sum(n))
marginal_parent <- ces %>%
  count(abb, parent) %>%
  group_by(abb) %>%
  mutate(marginal_parent = n/sum(n))
marginal_military_household <- ces %>%
  count(abb, military_household) %>%
  group_by(abb) %>%
  mutate(marginal_military_household = n/sum(n))
marginal_homeowner <- ces %>%
  count(abb, homeowner) %>%
  group_by(abb) %>%
  mutate(marginal_homeowner = n/sum(n))
```

```
# merge the marginal distributions together
synthetic_psframe <- psframe %>%
  left_join(marginal_pew_religimp, by = 'abb') %>%
  left_join(marginal_homeowner, by = 'abb') %>%
  left_join(marginal_urban, by = 'abb') %>%
  left_join(marginal_parent, by = 'abb') %>%
  left_join(marginal_military_household, by = 'abb') %>%
  # and multiply
  mutate(prob = prob * marginal_pew_religimp *
           marginal_homeowner * marginal_urban *
           marginal_parent * marginal_military_household)
```
Then, poststratify as normal using the synthetic poststratifcation frame (Fig. 5.8).

**Fig. 5.8** Estimates from synthetic poststratifcation, including additional covariates

```
# make predictions
synthetic_psframe$predicted_probability <- predict(model4, synthetic_psframe,
                                                    type = 'response')
# poststratify
poststratified_estimates <- synthetic_psframe %>%
  group_by(abb) %>%
  # (note that we're weighting by prob instead of n here)
  summarize(estimate = weighted.mean(predicted_probability, prob))
compare_to_truth(poststratified_estimates, truth)
```
#### *5.3.7 Best Performing*

As a fnal demonstration, suppose we had access to the entire joint distribution over those covariates, *and* our frst-stage model was a Super Learner ensemble. This combination yields the best-performing estimates yet (Fig. 5.9).

**Fig. 5.9** The best performing estimates, using a large survey sample, ensemble frst-stage model, and full set of predictors

```
# construct the poststratification frame
psframe <- ces %>%
  count(abb, gender, race, age, educ,
        division, biden_vote_share,
        pew_religimp, homeowner, urban,
        parent, military_household)
# fit Super Learner
SL.library <- c("SL.ranger", "SL.glm")
X <- sample_data %>%
  select(gender, race, age, educ,
        division, biden_vote_share,
        pew_religimp, homeowner, urban,
        parent, military_household)
newX <- psframe %>%
  select(gender, race, age, educ,
        division, biden_vote_share,
        pew_religimp, homeowner, urban,
        parent, military_household)
sl <- SuperLearner(Y = sample_data$defund_police,
                       X = X,
                       newX = newX, 
                       family = binomial(),
                       SL.library = SL.library, 
                       verbose = FALSE)
# make predictions
psframe$predicted_probability <- sl$SL.predict
# poststratify
poststratified_estimates <- psframe %>%
  group_by(abb) %>%
  summarize(estimate = weighted.mean(predicted_probability, n))
compare_to_truth(poststratified_estimates, truth)
```
The results shown in Fig. 5.9 refect all the gains from a larger sample size, ensemble modeling, and a full set of individual-level and group-level predictors.

#### **5.4 Conclusion**

For policy researchers interested in public opinion, MRP and its various refnements offer a useful approach to get the most out of survey data. The results I've presented in this chapter suggest a few lessons to keep in mind when applying MRP to one's own research.

First, be wary of frst-stage models that are underft or overft to the survey data. As we saw in Fig. 5.3, MRP estimates with too few predictors tend to over-shrink toward the grand mean.5 Using such estimates to inform subsequent causal inference would understate the differences between regions. Conversely, models that are overft to survey data (e.g., Fig. 5.4) will tend to exaggerate regional differences.

Second, new techniques like synthetic poststratifcation and stacked regression can help researchers manage the trade-off between underftting and overftting. Synthetic poststratifcation allows for the inclusion of more relevant predictors, and regularized ensemble models help ensure that the predictions are not overft to noisy survey samples. The best estimates often come from combining these two approaches.

Finally, recall that the most signifcant performance gains in our demonstration came not from more sophisticated modeling techniques, but from more data. As we saw in Fig. 5.6, working with a larger survey yielded greater improvements than any tinkering around with the frst-stage modeling choices. MRP is not a panacea, and one should be skeptical of estimates produced from small-sample surveys, regardless of how they are operationalized.

In the code above, I emphasize "do-it-yourself" approaches to MRP – ftting a model, building a poststratifcation frame, and producing estimates separately. But there are a now number of R packages available with useful functions to help ease the process. In particular, I would encourage curious readers to explore the *autoMrP* package (Broniecki et al., 2022), which implements the ensemble modeling approach described above and performs quite well in simulations when compared to existing packages.

#### **Further Suggested Readings**


<sup>5</sup> In the limit, a frst-stage model with zero predictors would yield identical poststratifed estimates for each state, equal to the survey sample mean.

#### **Review Questions**


#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 6 Pathway Analysis, Causal Mediation, and the Identifcation of Causal Mechanisms**

#### **Leonce Röth**

**Abstract** This chapter presents the systematic analysis of causal mechanisms from the perspective of pathway analysis as an essential complement to conventional approaches to causation. It builds on the evidence that credible causal identifcation defes design-based strategies such as randomization or linear mediation analysis unless their research designs are supported by reliable mechanistic knowledge. The chapter reasons that the reliable causal identifcation of a mechanism requires the concept of 'natural indirect effect' and a double-nested counterfactual strategy. It discusses the empirical quantifcation of causal mechanisms and its underlying assumptions, offers empirical examples that clarify them, and reviews the conditions and limits of the strategy.

#### **Learning Objectives**

After studying this chapter, you will be able to:


L. Röth (\*)

University of Cologne, Cologne, Germany e-mail: Leonce.Roeth@uni-koeln.de

#### **6.1 Introduction**

An increasingly popular postulate of causal analysis maintains that good research includes some account of *how* one variable generates another to underpin a causal claim. Causal mechanisms are at the center of research in small-n analyses, often are a crucial part of the theoretical argument in large-n studies, and prove indispensable for scholars of systematic pathway analysis. In some accounts, a credible causal mechanism makes the difference between explanatory and non-explanatory propositions (Waldner, 2007, 146; Kiser & Hechter, 1991, 5; Mayntz, 2004, 14; Hedström, 2008).

Asking not just for a cause of an effect but also for the intermediate process in between is a deeper or second form of asking *why* (Pearl & Mackenzie, 2018, 299–300). The response to this deeper *why* always complements other types of evidence but remains crucial for qualifying the external and internal validity of causal relations. Indeed, mechanisms can raise our confdence in the established validity of a causal association – or undermine it (internal validity). Moreover, their knowledge can change the inference on evidence even from well-executed trials and improve the next experimental setup. This is because mechanisms convey information on the scope conditions of a causal association, which expose the limits of causal effects and their underlying processes (external validity). Besides, knowledge of mechanisms can reveal multiple pathways between cause and outcome, thus guiding us to more effective interventions.

A textbook illustration of these points comes from one of the earliest documented controlled experiments. In 1747, James Lind observed that eating citrus fruits prevents scurvy; understanding and validating the mechanism between citrus intake and scurvy prevention took another 183 years. In the meantime, the link from citrus to scurvy was discredited because the mechanism and its scope conditions remained unknown.1

The central intuition about the citrus treatment was that it involved vitamin C – a particular type of acid, later called 'ascorbic' in recognition of its scurvy preventive properties. We now know that vitamin C oxidizes when exposed to heat and light or put in contact with copper. In other words, the citrus treatment only works under specifc scope conditions. Back then, however, the juice was heated for conservation, copper pipes were in widespread use, and exposure to light was regular. Thus, many attempts to produce lime juice for sea travels proved ineffective against scurvy.

Furthermore, mechanisms take time to unfold. Today we know that the intake of ascorbic acid activates the synthesis of the enzyme collagen IV. Collagen is a structural protein necessary for healthy blood vessels, muscle, skin, bone, cartilage, and other connective tissues. Ascorbic acid is required for various biosynthetic pathways; when these pathways decay, humans develop a series of symptoms

<sup>1</sup>The startling history of the cure for scurvy is well told in Lewis (1972). Pearl and Mackenzie (2018) recall it to illustrate mediation. This chapter's version enriches the history with some recent knowledge about the causal mechanism, and gives center stage to its scope conditions.

collectively assembled in the diagnosis of scurvy. Moreover, humans cannot synthesize collagen without ascorbic acid and have a low capacity to store it. As collagen IV synthesis stops 4–12 weeks after the last intake of ascorbic acid, symptoms of scurvy start to be visible after 4 weeks. The citrus intake also appeared ineffective for sea travels as the diffusion of steam navigation made many sea trips too short for the symptoms to show. However, Arctic expeditions remained long enough, and many seafarers suffered from scurvy in expeditions until the early twentieth century.2

For long, the wrong inference that citrus intake is ineffective for scurvy prevention survived due to the lack of knowledge of the mechanism of activation of collagen IV synthesis. Filling this gap proved crucial for restoring the causal association, as the mechanism disclosed many necessary scope conditions required for it to hold – namely, time, temperature, and exposure to light or copper. These conditions imply that the link between the effect of the treatment and the outcome can only be established in a study period of at least 4 weeks and if the ascorbic acid is kept intact. Moreover, they suggest that the link blurs whenever equivalent pathways are activated – for instance, if seafarers can eat raw meat or any fresh food containing suffcient ascorbic acid. Thus, perfect randomization of citrus intake may not reveal its preventive effect when its design does not take the relevant scope conditions of the mechanism into account.

In short, the knowledge of mechanisms improves three vital criteria of scientifc inference – reliability and internal and external validity. But how to study mechanisms systematically?

In the following, I present the answer provided by the particular version of pathway analysis that merges graph theory with a counterfactual model of causality into a powerful framework for identifying mechanisms. This development is roughly 15 years old and still in full swing. It has taken computer science and biology by storm: biostatisticians now usually run millions of pathway models a minute to analyze gene expressions and understand the mechanisms linking a drug treatment and its effect. In comparison, social scientists still seem hesitant to embrace the many benefts that such a pathway perspective can bring. This chapter's frst and foremost intention is to reduce hesitation.3

To this end, Sect. 6.2 locates the mechanistic why-question in the philosophy of science and discusses the assumptions under which a generic defnition of a pathway or mediator4 can be called 'a mechanism'. Then, Sect. 6.3 discusses how to distinguish between mechanistic associations and causal mechanisms. To this end, it dwells upon a remarkable strength of this method for pathway analysis – a

<sup>2</sup>Notably, the two expeditions of Robert Falcon Scott to Antarctica in 1903 and 1911 suffered greatly from scurvy.

<sup>3</sup>Excellent discussions of causal identifcation of mechanisms using graph theory are in Morgan and Winship (2015, Chap. 10); Pearl and Mackenzie (2018, Chap. 9); VanderWeele (2015, Part One). This chapter owes almost everything to these contributions. However, it takes a more specifc angle on the causal identifcation of mechanisms in the social sciences.

<sup>4</sup>Note that, in some disciplines, the identifcation of mechanism is synonymous with causal mediation analysis. Here, instead, mediation is considered a special instance of pathway analysis.

graphical rendering of causal assumptions that helps to lay out the structural conditions under which pathways are causally identifed or mistaken. Thus, it clarifes how the graph perspective improves on one of the most applied and cited methods in the history of the social sciences – the so-called Baron-Kenny approach to mediation analysis – and, in so doing, enhances our conditioning strategies.

Section 6.4 discusses the innovative core of pathways analysis – namely, the 'decomposition' and the quantifcation of the total, direct, and indirect effects on observational data. Indeed, Judea Pearl and others spearheaded a causal revolution when they defned the conditions of causally identifed pathways and developed non-parametric formulae to decompose total effects into direct and indirect ones (Pearl, 2022). This quantifcation strategy of pathway effects took time to be accepted and faced some deep-rooted skepticism from the more conventional quarters of causal analysis (e.g., Rubin, 2004; Rubin, 2005). Nevertheless, social science scholars are slowly getting familiar with indirect effects and their underlying counterfactual theory of causation (see Imbens, 2020).

Section 6.5 replicates one infuential model from development economics and sketches another from educational research. The frst example demonstrates how strong supposedly mechanistic inference based on innovative cluster randomization in Kenya can be misleading. The second example shows how pathways analysis can draw important mechanistic lessons from a randomized controlled trial run in the United States to seemingly no effect. These examples prove mechanistic knowledge essential to validate and refne even causal evidence from compelling research designs.

The last section of this chapter intends to keep the promises of the pathway approach in check and dispel the illusion that causal identifcation is a simple technical exercise. As randomized controlled trials or instrumental variable applications show, the devil lies in the detail of the exclusion restrictions; in this respect, pathway causal identifcation is even more demanding than total effects via randomization or quasi-randomization. Pathway analysis reminds us that our models seldom ensure the perfect causal identifcation of a mechanism. Indeed, the complexity of the real world typically defes our attempts to draw exhaustive causal maps with analytic tools that require exclusion restrictions. Nonetheless, these restrictions ensure the transparent rigor that qualifes evidence as causal and distinct from mere association.

#### **6.2 Can Pathways Be Mechanisms?**

Sometimes, the concepts of mechanism, pathway, and mediation can be confusing. All three terms adhere to the general idea of increasing causal depth by diminishing the contiguity of time and space between cause and outcome. However, what exactly is considered a cause–effect framework and a mechanistic framework is subject to the relative status of a research feld and is constantly in fux (see also Chap. 2, Sect. 2.3.1).

What appears to be a suffciently deep causal mechanism in one particular research tradition and time can be perceived as a superfcial association in another. Ideally, research felds increase causal depth over time and remain cautious about the trade-off between desirable specifcity and useful parsimony (Craver & Kaplan, 2020). The balance of specifcity and parsimony changes while research progresses, and what was considered a mechanism once might be addressed as separate cause– effect relations. Recall from the introduction that it took 183 years to detect the crucial acid for the mechanism between citrus intake and scurvy prevention. During the attempts to isolate ascorbic acid, the intake of vitamin C could have been appropriately described as the causal mechanism. In light of new knowledge, researchers today focus on way more specifc biosynthesis pathways as distinct causal relationships. In short, researchers have approached the old mechanism to more causal depth. Philosophers of science call this kind of deepening process "bottoming-out" (see Fig. 6.1) or, in simpler terms, delivering on the demand for the explanation that can stop the infnite regress in causal analysis.

Aiming at fundamental explanations has had a strong appeal for a long time now in the social sciences (see Elster, 1989; Goldthorpe, 2001; Hedström et al., 1998; Hedström & Ylikoski, 2010; Knight & Winship, 2013). Nonetheless, causal mechanisms are also seen as the least understood kind of causal claim (Gerring, 2010; Hedström & Ylikoski, 2010; Waldner, 2012).

Some scholars use the term "mechanism" to refer to a series of events between the original cause and the outcome (Abell, 2004; Mahoney, 2012; Morgan & Winship, 2015; Pearl, 2009, Pearl & Mackenzie, 2018). The concept of "pathway", too, indicates a chain of mediators connecting a cause to an outcome. Thus, some have embraced the term "mechanism" for the analysis of pathways across cases (see Gerring, 2010; Imai et al. 2011; Weller & Barnes, 2014; Woodward, 2003, 350–58; Runhardt, 2015; Morgan & Winship, 2015, 325–352). Other scholars, however, try to exclusively use the term "causal mechanism" for process tracing within single cases (for example, Beach, 2017). These scholars adhere to the "process" or "physical" theories of causation that provide a substantive account of what causal processes are in light of what science tells us about the world (Dowe, 2000, 1–11 and Chap. 10).

Far from a terminological subtlety, these usages point to a fundamental divide over the concept of mechanism. The frst group considers causality a matter of epistemology that can be addressed with probabilistic or counterfactual models. From this standpoint, establishing causation is an exercise in logic that many techniques

**Fig. 6.1** Approaching to causal depth

can perform – provided that they afford comparisons ("type" causality; see Rohlfng & Zuber, 2021, 1634–35). In contrast, the holders of the process theory of causation maintain that causality is necessarily local – which means that it is manifest only in individual cases ("token" causality). Following the process view, within every unique case, causality exists in fne-grained sequences of entities' activities that have to satisfy the criterion of seamless productive continuity (Dowe, 2000). From the perspective of bottoming-out, the process viewpoint on mechanistic causation raises the highest possible demand on causal depth.

A pathway as a sequence of mediators (or interactions) cannot satisfy the ontological criteria established by the process view of mechanistic causation. First, seamless productive continuity can hardly be demonstrated by pathway analysis. Second, the very strength of pathway analysis lies in inferences from comparisons across cases or samples. In short, from the process view on causation, pathways do not deserve the term "mechanism". However, this reservation is a relative rarity in the social sciences. Most scholars are satisfed with an evidential view on mechanisms as a cause-to-effect pathway that at least includes one mediator. Even without satisfying the high demands from the process view, pathway analysts also approach causal depth as they want to know what connects a supposed cause and its outcome at the fundamental level, hence in a general form. As we will see in the next part, the biggest strength of pathway analysis in that ambition for deeper explanations is epistemological. Pathway analysis has developed clear and transparent criteria to distinguish causal mechanisms from mechanistic associations.

#### **6.3 Identifying Causal Mechanisms with Graphs**

Causal identifcation is a general problem independent of the commitment to a mechanistic theory (Pearl, 2009). Pearl's metaphor of a "ladder of causation" renders the solutions to the identifcation problem as a historical endeavor to more reliable causal knowledge (Pearl & Mackenzie, 2018, 23–52). In this line of thought, scientists moved from the regularity theory over probabilistic theory to the interventionist theory before reaching the top level of the counterfactual theory. As Pearl's argument goes, counterfactuals win the highest pitch as they synthesize and improve on previous solutions to causal identifcation problems.

From a regularity viewpoint, only the perfect sequence of the candidate cause and outcome constitutes evidence for causation. In our scurvy example, the regularity criterion requires that every citrus intake prevents scurvy without exceptions. The scope conditions of the mechanism demonstrated this bare inference mostly wrong. Under some circumstances, citrus can fail, or the causal effect might be observed without citrus. In Pearl's account, the limits of perfect regularity motivate the shift toward the probabilistic account of causality.

The probabilistic account admits that a causal relation unfolds or fails due to scope conditions and alternative mechanisms but maintains that many of them remain unknown. Hence, our best knowledge about citrus intake can focus on

whether it affects the probability of getting scurvy net of contextual vagaries – that is, on average. However, evidence that a factor affects the probability of an outcome does not constitute evidence for causation either. A limit of the probabilistic approach is that it cannot establish the direction of causation – a problem known as "asymmetry" or "endogeneity". In light of observed probability, for instance, it might also be that scurvy causes lemon intake.

The problem of asymmetry is solved when the candidate cause precedes the outcome. The best way of ensuring this order is to get some control over the candidate causal factor. So, if we prescribe citrus intake to healthy and compliant seafarers once on board, we can gather more convincing evidence of its contribution to the probability of getting scurvy. This approach is at the heart of the 'interventionist' school of causality.

With the asymmetry problem being solved, the thorniest issue of causal identifcation takes center stage. Even in an interventionist framework, confounders can bias the identifcation. Thus, we might mistake the sequence of two events as causal despite it being due to a third unobserved factor instead. Logically, the counterfactual theory of causation can discriminate between a confounded relationship and a causal one. The observed event is the real cause when it precedes the outcome, *and* its manipulation resonates with a change in the outcome that would not have occurred without the intervention. Thus, the counterfactual subsumes all preceding approaches to causal identifcation. Moreover, it embraces the 'would haves' and, on this basis, can offer a single theoretical solution to both asymmetry and confounding problems.

The counterfactual approach is deeply embedded in pathway analysis with graphs. Its notation responds to the problem of asymmetry by using directed arrows to clarify the direction of causality in contrast to the equal sign typical of the regression framework. Directed arrows connect "nodes" or variables in structures of dependency that recall family trees. Thus, the nodes in a path of directed arrows can be indicated as "grand-parent", "parent", "child", and "grand-child." These structures embody strong and weak causal assumptions. An arrow between two nodes indicates a weak causal assumption. It renders the direction of dependency – the fact that values of the child variable change in response to the values taken by the parent variable – but neither its sign5 nor the size of the causal effect. The strongest causal assumption is the absence of an arrow between two nodes, as it signals that the corresponding variables take their values independently of one another. Furthermore, pathway analysts have introduced the so-called "*do*-operator" to mimic an intervention on an arrow and model the effect of its removal on observational data. This operator marks a relevant difference from conventional counterfactual studies based on non-intervention.

<sup>5</sup>However, some biologists introduced a distinction in the notation of the positive and the negative effects.

#### *6.3.1 Closing the Backdoor*

Graph theory offers a transparent strategy to tackle the two crucial problems of causal identifcation, namely, asymmetry and confounding. Figure 6.2 illustrates the task in its simplest form.

On the left-hand side of Fig. 6.2, we see the identifcation for the total effect framework, as in a typical correlation or regression analysis. To declare the association between X and Y causal, we frst need to demonstrate that X precedes Y and not the other way around. This assumption is embodied in the direction of the arrows. The second task is to check that the association between X and Y is not confounded by third factors such as C. Path X ← C → Y is a so-called "open back-door path" and can be seen as a pipe where non-causal variance is fowing that confounds the true relationship between X and Y*.* Back-door paths can be closed in two ways. First, by conditioning on C. If we can hold C constant, the back-door paths between X and Y are closed, and the association between X and Y is not confounded anymore. To hold confounders constant is a common identifcation strategy – for example, in multivariate regressions where we regress Y on X and condition on C (Pearl & Mackenzie, 2018, 157). A second widespread approach is the randomization of X. If we assign the treatment condition of X randomly, all associations running into X are broken, and, therefore, all back-door paths are closed (compare middle part of Fig. 6.2). Experimental designs build on the randomization of the treatment. In quasi-experimental designs – such as regression discontinuity or instrumental variables – randomness in the assignment to treatment arises indirectly from natural factors or events independently of the causal channel of interest (see Chap. 3). If we can rule out both reversed causality and confounding, the associations between X and Y imply causation by necessity. The power of the back-door criterion is that it reveals under which conditions associations are causal even based on observational data.

In a mechanistic framework, the two conditions for a causal interpretation of associations are the same: X needs to precede Y, and all back-door paths between X and Y need to be closed, as on the right-hand side of Fig. 6.2. However, these conditions allow the causal interpretation of the total effect between X and Y, not the causal interpretation of the other quantities of interest to a mechanistic framework – namely, the effect of X on M (X → M, M being the mediator), and the effect of M on Y (M → Y; Y being the outcome). More conditions must be fulflled to allow for a causal interpretation of the associations b and c on the right-hand side of Fig. 6.2.

**Fig. 6.2** Causal identifcation with and without a mechanism

X has to precede M, and M has to precede Y. Furthermore, all three associations (a, b, and c) have to be un-confounded to reveal the 'true' causal effect from X → M, from M → Y, and the remaining effect of X → Y. In that framework, the total effect equals the sum of the effect from X over M to Y (the *indirect* effect) and the remaining effect of X on Y (the *direct* effect).

If we randomize the treatment X of a mediation model, the randomized treatment blocks all arrows running into X. In the example on the right-hand side of Fig. 6.2, the randomization means ruling out the confounding of C1 and C2 so that the total effect of X on Y still is the true causal effect. However, even with a randomized treatment, we are still unable to quantify the indirect effect. The reason is that C3 is left unconditioned and confounds the relationship between M and Y (path c). Randomization of the treatment does close all back-door paths running into X but does not suffce to identify mechanisms. Unfortunately, the problem of potential confounding between M and Y runs even deeper.

Figure 6.3 represents a famous causal model of the effect of smoking on child mortality. It represents precisely the constellation described on the right-hand side of Fig. 6.2 and represents a fundamental problem of mechanistic identifcation, the collider bias. The collider bias has troubled statisticians for centuries and led to uncountable false inferences, the birth-weight paradox just being a prominent example.6

Let us consider the example in Fig. 6.3. In the mid-1960s, Jacob Yerushalmy pointed out that smoking during pregnancy seemed to beneft the health of children if the baby happened to be born underweight – the so-called "birth-weight paradox" (see Yerushalmy, 1971).7 Until 2006, this paradox remained unexplained.

In an extensive data set, Yerushalmy found unexpected relationships. Babies of smokers were lighter than babies of non-smokers. However, within the group of low-birth-weight babies, the babies of smoking mothers had a better survival rate than those of non-smokers. It was as if the mother's smoking had a protective effect within the group of babies being born underweight. The inference was that "there is no causal path from smoking to mortality" (Yerushalmy, 1971). How come?

Yerushalmy's fndings are the consequence of a problematic conditioning strategy. He was unaware of the importance of genetic disposition and operated under

<sup>6</sup> It likely was Barbara Burks who frst modeled the problem using causal graphs in 1926.

<sup>7</sup>An excellent discussion of the birthweight paradox can be found in Wilcox (2006).

**Fig. 6.4** Collider bias in mediation analysis

the assumption of the left model in Fig. 6.4. However, even within that model, it does not make sense to condition on birthweight. Birthweight is not a confounder, but a mediator. Conditioning on the mediator means correcting for the variance that runs through it. In the example, it means controlling for the *indirect* effect of birthweight. The remaining effect of X on Y is typically seen as the *direct* effect.

Conditioning on a mediator is justifed to separate the indirect effect (X → M → Y) from the direct one (X → Y). As such, it lies at the heart of the conventional mediation analysis. Indeed, conventional mediation analysis compares effect estimates of the cause based on two separate regressions. The crucial difference runs between the estimate of the coeffcient of X on Y in a model without a mediator and in one conditioned on the mediator. As an illustration, if 100% of the variance of the effect from cause X runs through mediator M, conditioning on M leads to a null coeffcient of the cause. Baron and Kenny (1986) defne three necessary, but not suffcient, conditions for detecting mediation along these lines8 :


This reasoning allows inferring four types of mediations based on how the effect between X on Y changes when we condition on M (see Fig. 6.5).

Conventional mediation analysis speaks of 'full mediation' when the total variance is associated with the path from X via M to Y (indirect effect), and the direct effect of X on Y leaves nothing unexplained. "Partial mediation" is inferred from a reduced direct effect of X on Y after conditioning on the mediator. "No evidence for mediation" is inferred when the conditioning on the mediator does not affect the direct effect from X on Y. Finally, "inconsistent mediation" is inferred when the adjustment on the mediator reverses the direction of the effect of X on Y.

The birth weight paradox is an instructive example of inconsistent mediation. The reason is that the most prominent factor for low birth weight is a specifc genetic disposition that sorts an even higher impact on mortality than smoking. Genetic dispositions confound the path M → Y, as illustrated on the right-hand side of

<sup>8</sup>Note that this paper is one of the most cited papers in scientifc history.

**Fig. 6.5** Types of mediation. (**Note:** \*\*\* refers to the level of signifcance)

Fig. 6.4. It is easy to see that Yerushalmy overlooked an important confounder; what is not so easy to see is that Yerushalmy conditioned on a *collider*.

A collider is given when the same outcome depends on two different causes or, in graphical terms, when at least two arrows point to the same node. In Fig. 6.4, birthweight is a mediator (X → M → Y) and a collider (X → M ← C). Adjusting for the collider means opening a closed back-door path from X over C to Y. In other words, conditioning on birthweight creates a spurious positive association between the smoking of mothers and children's survival because genetic dispositions confound the relationship between birth weight and child mortality.

In short, Yerushalmy's surprising fndings follow from this troublesome conditioning strategy. Conditioning on birth weight leads to an entirely new comparison within the stratum of children with low weight at birth. Within this new stratum, smoking mothers seem to affect babies' survival positively. However, this association is spurious. Genetic disposition has an even stronger effect on birth weight than smoking, and unless controlled for, it biases the association between birth weight and child mortality.

The graph-theoretical solution of the birth weight paradox offers at least two important lessons. First, while conditioning on confounders closes back-door paths and yields unbiased associations, conditioning on mediators and/or collider variables leads to biased associations. Second, and more important for the causal identifcation of mechanisms, standard mediation analysis proves unreliable. Conditioning on a collider has caused uncountable "mediation fallacies" (Pearl & Mackenzie, 2018, 315). Despite the increased awareness, the pervasiveness of the problem can still be underestimated. Indeed, mediation fallacies are not limited to the cases of inconsistent mediation. Instead, they may affect all types of conventional mediation with signifcant consequences. If a collider cannot be ruled out, regression-based mediation analysis cannot be trusted to produce reliable effect estimates as we cannot quantify the bias introduced by conditioning on the mediator.

Figure 6.6 illustrates a more complex causal system where we might be interested in the relative importance of pathway X → M1 → M2 → Y versus pathway X → M3 → Y. This identifcation task clearly falls beyond the possibilities of the regression framework and demands the more powerful approach to pathway analysis that graphs afford instead.

The overall model entails 11 variables and consists of 16 paths. The back-door criteria guide us to an effective conditioning strategy. There is no confounding between X and Y and the total effect represents the true causal effect, as we declare the causal system exhaustive. However, estimating the indirect effect of the two

**Fig. 6.6** More complex pathways

pathways of interest requires conditioning. The effect of path b is biased unless we condition on C1. The effect of path d is biased unless we condition on C2, C3, or C2 and C3 – conditioning on any of these confounders blocks the back-door path M2 ← C2 → C3 → Y effectively. A1 could be considered an alternative explanation for Y on which it is unnecessary to condition because it does not affect the quantities of interest. C4 and C5 should not be conditioned on: C4 is a collider and would open the non-active backdoor path M3 → C4 → C5 → Y; similarly, C5 should not be conditioned because of the extended collider rule that even 'descendants' of colliders, too, activate back-door paths.

The overall goal of the conditioning strategy guided by the back-door criterion is to block all the paths that generate non-causal associations between the cause and the outcome without inadvertently blocking any of the paths that generate the causal effect itself (Morgan & Winship, 2015, 109). Conditioning on C in Fig. 6.2 is a viable option whereas conditioning on M in Fig. 6.3 opens an otherwise closed back-door path. Eventually, with Morgan and Winship (2015, 109), the back-door criterion can be defned as follows:

If one or more back-door paths connect the causal variable to the outcome variable, the causal effect is identifed by conditioning on a set of variables Z if

*Condition 1***:** All back-door paths between the causal variable and the outcome variable are blocked after conditioning on Z, which will always be the case if each back-door path


and

*Condition 2***:** No variables in Z are descendants of the causal variable that lie on any of the directed paths that begin at the causal variable and reach the outcome variable.

However, closing the back-doors is only one of two possible identifcation strategies.

#### *6.3.2 Closing the Front Door*

The front-door criterion provides another interesting identifcation strategy derived from causal graph theory in cases where essential confounders remain unobserved. For example, let us turn to the prize-winning paper on skills and the labor market by Glynn and Kashin (2018). Glynn and Kashin applied the front-door criterion to a well-known dataset on the effect of the Job Training Partnership Act (JTPA). The Act institutes a job training program to equip participants with different skills. The dataset contains data on the people who applied for the program, whether they showed up, and their earnings over 18 months. The study includes a randomized control trial (RCT) and an observational component. Figure 6.7 provides the causal graphs of the general problem (left), the example (middle), and the front-door approach (right).

The variable *signed up* records whether a person did enroll to the job training, the variable *showed up* whether the enrollee did use the services. The program can only affect the earnings if users showed up, so the absence of a direct arrow between *signed up* to *earnings* can be easily justifed. In other words, the entire effect is mediated. Let us say cause, outcome, and mediator are all affected by the general motivation of an applicant, but unfortunately, we have not measured motivation. In a causal graph, an unmeasured variable is typically depicted by a hollow node.

The logic of the front door is to block all paths running into M – in other words, to shield the mediator. In the example of Fig. 6.7, we might randomly call applicants off and compare the randomly canceled applicants with those given real training. With all front-door paths being closed, the estimates of paths b and c can be calculated and are unbiased by defnition. In that example, absent a direct effect, the indirect effect equals the total effect, and the estimate using the front-door equals the estimate based on the randomization of X. Glynn and Kashin compared the front-door predictions with those from a randomized controlled experiment, and found the results very similar (Glynn & Kashin, 2018).

The front-door approach could remove almost all of the bias introduced by the omission of the confounder of motivation. In contrast, a simultaneous estimation using the back-door without the possibility of conditioning on motivation showed substantial differences to both the experimental results and the front-door approach (Glynn & Kashin, 2017, 2018).

With Morgan and Winship (2015, 333–334), the front-door criterion can be defned as follows:

If one or more unblocked back-door paths connect a causal variable to an outcome variable, the causal effect is identifed by conditioning on a set of observed variables, M, that make up an identifying mechanism if

**Fig. 6.7** How to shield a mediator

*Condition 1* (*exhaustiveness*): The variable in the set M intercepts all directed paths from the causal variable to the outcome variable.

and

*Condition 2* (*isolation*): No unblocked back-door paths connect the causal variable to the variables in the set M, and all back-door paths from the variables in the set M to the outcome variable can be blocked by conditioning on the causal variable.

At this point, we have learned two different ways to identify causal mechanisms. By defnition, closing all back-door paths or closing all front-door paths leads to causal estimates even with observational data. The logic of back-door paths explains why the identifcation of indirect effect is neither ensured by the randomization of the cause nor by conditioning on the mediator as applied by conventional regressionbased mediation analysis. The next section discusses how indirect and direct effects can nonetheless be identifed.

#### **6.4 Identifying Indirect Effects**

For a long time, mediation analysts defned:

Total Effect Direct Effect Indirect Effect

This formula understands the indirect effect as a residual category. The Baron-Kenny approach (1986) is entirely built upon this logical pillar. As a straightforward consequence, the conventional approach advised conditioning on the mediator to arrive at the direct effect and, in force of the composition assumption, calculating the indirect effect of mediation as the total minus the direct effect.

The frst problem, as already seen, is that the composition stands if M and Y are not confounded or, in other words, if a collider bias can be ruled out. The second problem is that the estimate of the residual is only credible in strictly linear systems. Once we relax the linearity assumption, the composition rule fails (Pearl & Mackenzie, 2018, 322–336).9

#### *6.4.1 Indirect Effect in Non-linear Systems*

The language of indirect, direct, and total effects evolved in the 1970s, but only recently was the indirect effect defned in causal terms. This shift entailed embracing counterfactual thinking.

<sup>9</sup>The problem of conventional mediation analysis is very fundamental. Mediation analysis based on the difference methods (Baron & Kenny, 1986; Judd and Kenny, 1981) and linear regression models suffer from problems in the presence of interactions, non-linearities, binary outcomes, unobserved confounders, and other modeling complications (see Shpitser, 2013).

Let us start with the direct effect using the *do*-calculus. In the simple graph of treatment (X), mediator (M), and outcome (Y), we get the direct effect of X on Y when we intervene on X without allowing M to change. We *do*(M = 0) and randomly assign units to *do*(X = 1) or *do*(X = 0). We call this the 'controlled direct effect' or CDE.

CDE(0) raises when we force the mediator to take on the value of zero and can be computed as

$$CDE(0) = Pr\left(\mathbf{Y} = 1 \mid do\left(\mathbf{X} = 1\right), do\left(\mathbf{M} = 0\right)\right) - Pr\left(\mathbf{Y} = 1 \mid do\left(\mathbf{X} = 0\right) \mid do\left(\mathbf{M} = 0\right)\right)$$

Had we forced the mediator to be 1, we would have denoted the resulting controlled direct effect as CDE(1). In practice, however, this alternative strategy could prove unwise as it forces M on instances of X that are potentially implausible to observe. Moreover, inferring the direct effect from the difference between CDE(1) and CDE(0) is to infer from an over-controlled experiment.

The so-called 'natural direct effect' or NDE offers an alternative perspective. We randomize X, but let M take the value it would naturally do. The 'would' indicates that a counterfactual is required and can be calculated as follows:

$$NDE = Pr\left(\mathbf{Y}\_{\mathbf{M}-\mathbf{M}0} = 1 \mid do\left(\mathbf{X} = 1\right)\right) - Pr\left(\mathbf{Y}\_{\mathbf{M}-\mathbf{M}0} = 1 \mid do\left(\mathbf{X} = 0\right)\right).$$

The NDE subtracts the probability of having a positive outcome without the treatment (X = 0) under M equal to zero from the probability of having a positive outcome with the treatment (X = 1) again under null M. In short, the NDE holds the mediator constant while the treatment is forced toward specifc values. Indirect effects, unlike direct effects, have no controlled version because there is no way to disable the direct path by holding some variable constant.

Indirect effects have a natural version, too, which again requires thinking in counterfactual terms. The natural indirect effect (NIE) is when we would abstain from the treatment, but allow the mediator to be present. Understanding the causal properties of the indirect effect requires a double-nested counterfactual. In formal terms, we can defne the natural indirect effect as follows:

$$NIE = Pr\left(\mathbf{Y}\_{\mathbf{M}-\mathbf{M}1} = 1 \mid do\left(\mathbf{X} = 0\right)\right) - Pr\left(\mathbf{Y}\_{\mathbf{M}-\mathbf{M}0} = 1 \mid do\left(\mathbf{X} = 0\right)\right).$$

The frst term indicates the probability of a positive outcome under absent treatment and present mediator. From this quantity, we subtract the probability of the positive outcome under the 'natural' situation where both the treatment and mediator are given.

The counterfactual M1 must be computed for each observation on a case-by-case basis. This requirement places the natural indirect effect out of the experimenters' reach as they may not know the value of the mediator M1 for any particular treatment X at the level of the individual unit. However, assuming there is no confounding between X and M as well as M and Y (i.e., ruling out the confounding and the collider bias), the NIE can still be computed on observational data. The natural indirect effect entails denying the treatment to anyone, and letting the mediator take the value it would have in the presence of the counterfactual treatment for each individual. The difference yields Pearl and Mackenzie (2018, 333) mediation formula as follows:

$$NIE = \sum\_{m} \left[ Pr\left(\mathbf{X} = 1\right) - Pr\left(\mathbf{X} = 0\right) \right] \cdot Pr\left(\mathbf{Y} = 1 \mid \mathbf{X} = 0 \mid \mathbf{M} = m\right).$$

The expression stands for the effect of X on M in the subset of the units where the mediator takes the value *m* (in square brackets) times the probability that Y = 1 when X = 0 and the mediator takes the value *m*. So formulated, the NIE exposes the source of the product-of-coeffcients idea and casts the product of two non-linear effects. Moreover, this formula allows calculating what is *explained by* mediation and the percentage *owed to mediation*.

#### *6.4.2 Indirect Effect When the Cause and the Mediator Interact*

The identifcation of indirect effects becomes more complex when the mediator and the supposed cause (or "exposure") interact. A unifed perspective on the decomposition of the total effect in a case where the independent variable of interest interacts with the mediator has been provided by VanderWeele (2014).

So far, effect decomposition has meant to split a total effect into an indirect and direct one. In the presence of exposure-mediator interaction, two components need to be added: the one due to interaction only; the other due to mediation and interaction (see VanderWeele, 2014, 751). The counterfactual assumptions to identify the effect quantities are similar to those required to analyze causal mediation without interaction. As in the case of causal mediation, indirect effects including interactions require double-nested counterfactuals, whereas the direct effect requires weaker assumptions. The attribution of the interaction quantities to either the indirect or direct effect, instead, remains an empirical question. Figure 6.8 illustrates two possible response strategies based on VanderWeele (2014, 757).

The fourfold decomposition depicted in Fig. 6.8 encompasses both decompositions for mediation and interaction.

For interaction, the reference interaction (INT*ref*) and the mediated interaction (INT*med*) combine to the portion attributable to interaction (PAI). The portion attributable to interaction (PAI) combines with the controlled direct effect (CDE) and the pure indirect effect (PIE) to give the total effect (TE).

**Fig. 6.8** Fourfold decomposition

For mediation, the controlled direct effect and the reference interaction (INT*ref*) combine to give the pure direct effect (PDE); the pure indirect effect (PIE) combines with the mediated interaction (INT*med*) to give the total indirect effect (TIE), and the pure direct effect (PDE) combines with total indirect effect (TIE) to give the total effect (TE).

#### *6.4.3 Wrapping Up*

The graph theory reveals that the identifcation of causal mechanisms requires counterfactuals. The natural indirect effect is when we abstain from the treatment, but the mediator is present. Contrasted with the state where both the treatment and the mediator are present, we can quantify how much of the effect of X on Y is captured by the mediator M, and how much of Y is owed to the mediator M alone. Such a natural indirect effect gauges a causal mechanism once the back-door criterion is satisfed, e.g., all back-door paths are closed.

The consequences of this defnition are far-reaching. The identifcation of causal mechanisms appears as out of reach to the conventional mediation analysis than to randomization. What appears as bad news can also be a good insight, as the natural indirect effect yields a mediation formula stripped of any parametric assumptions. Under some assumptions, this formula allows quantifying the causal mechanism based on observational data. Section 6.5 demonstrates this claim with the example of a renowned identifcation debate.

#### **6.5 Applications**

#### *6.5.1 A Mechanistic View on the Worm Wars*

In this application case, I add a causal mediation view to the "worm wars" – a famous debate over the interpretation of infuential cluster randomization in Kenya that, besides other studies, brought one of its authors, Michael Kremer, the Nobel Memorial Prize in Economic Sciences in 2019.

The study originates from the evidence that nearly two billion people worldwide – mostly children – are infected by intestinal worms. These species inhabit the human digestive tract; they spread by expelling their eggs via the body waste of infected people. Without good sanitation, these microscopic eggs can fnd their way, unnoticed, onto the skin or food of another person. Once someone ingests an egg, the reinfection cycle continues. Poor sanitation facilities and hygiene practices allow infections to spread locally.

In 2004, a landmark study showed that an inexpensive medication to treat parasitic worms could improve health and school attendance for millions of children in many developing countries (Miguel & Kremer, 2004). Eleven years later, a headline in *The Guardian* reported that the deworming treatment had been debunked. In 2021, a carefully exercised replication study restated the original fndings (see Ozier, 2021). Why so?

Miguel and Kremer convincingly argued that, due to the infectiousness of the worms, individual treatments are unlikely to be effective because children will quickly re-infect themselves with other children. Consequently, they run an encompassing feld experiment in Kenya using cluster randomization at the school level. The experiment compared more than 25,000 treated children across three waves to a control group for each wave with similar attributes except for the suppressed treatment. They found a remarkable effect of the treatment on school attendance not only in the treatment area (up to 3 km) but also in the surrounding areas (3–6 km from the treatment).

Replication analyses have mainly confrmed the direct effect in the treatment areas. However, the spillover effects became subject to debate and turned insignifcant in some specifcations (for example, Aiken et al., 2014). The debate about the replication involved many infuential scholars, was covered by several blogs, and eventually came to be known as the "worm wars". A systematic review of the debate seemed to restore the trust in the key fndings of the original study. Ozier (2021) concluded that, if anything, years of debates and replication have reinforced his belief in the main effect. In short, it appeared as if the treatment of Miguel and Kremer had indeed sorted a substantial positive impact on children's school attendance.

However, there is a second line of skepticism, less concerned with the signifcance levels of the total effects but with the plausibility of the indirect effect. The indirect effect, as we have learned, considers the probability of a positive outcome (school attendance) given that we do not have a treatment (no de-worming drug intake), but we set the mediator (being, in fact, de-wormed) to the values as if we would have had treatment (de-worming drug intake). We contrast this with the probability of a positive outcome (school attendance) under natural conditions where the treatment is given (de-worming drug intake) and the mediator too (being dewormed). Based on Pearl's mediation formulae, we can compute the natural indirect effect using observational data. The results can be given a causal interpretation if we can exclude confounding between the mediator (being de-wormed) and the outcome (school attendance).

This mechanistic perspective on the study is of great interest for at least two reasons. First, experts in deworming cast considerable doubt on the fndings. Epidemiologists refused to include the paper in a meta-study for methodological reasons (no blinded treatment was performed) and referred instead to existing epidemiological studies that, if at all, showed very modest effects of deworming on school attendance. In other words, the authors of a Cochrane review were unconvinced that de-worming could have had such a substantial effect as reported in Miguel and Kremer (Taylor-Robinson et al., 2015). Second, the authors of the original experiment framed their study and their results as if they had strong evidence for the entire mechanism. In the words of the authors' abstract, "[*d*]*eworming substantially improved health and school participation among untreated children in both treatment schools and neighboring schools, and these externalities are large enough to justify fully subsidizing treatment*." (Miguel & Kremer, 2004, 159). In short, the authors' inference is that their evidence point to a clear recommendation for subsidizing de-worming treatments because de-wormed students have a higher likelihood of attending school. Is it the de-worming via the drug intake that causes students to attend school more often?

Based on the original data, the mediation formulae can be used to put the mechanistic claim under scrutiny. Table 6.1 includes all probabilities required to compute the natural indirect, natural direct, and the total effect based on the replication data of Miguel and Kremer (2014), Miguel et al. (2014).10 By relating indirect and direct effect quantities to the total effect, we can draw valuable conclusions. The natural indirect effect supports the suspicion of the epidemiologists. Only 1.8% of the total effect would be achieved by worm-free students alone. In contrast, 94.2% of the total effect is related to the natural direct effect of the treatment other than

<sup>10</sup>For the replication, I use a very simple model based on the drug treatment in the frst period of the feld experiment. The experiment had three waves, but the comparison groups changed during the waves and because the effect on school attendance is predominantly a result of the frst wave, I focus on the frst wave only. For the mediator, I use the reversed indicator of any moderate or heavy worm infection based on the WHO standard in 1999. I see the mechanism present when a treated student is indeed free of worms. For the outcome, I use a dummy of students being present in school at times of the surprise visit. The current documentation of the data is exemplary (see Miguel and Kremer, 2014; Miguel et al. 2014; Hicks and Nekesa, 2014).


**Table 6.1** Probabilities of the treatment, the mechanism, the outcome and the natural direct (NDI), indirect (NIE), and total effect (NTE)

**Note:** Compare equations for NIE, NDE, and TE above.

deworming students. Finally, 5.8% of the effect on attendance is owed to the capacity of the treatment to deworm students.11

How do we make sense of these numbers?

Humphreys (2015) documented and commented on the worm wars in close detail, driven by concerns for the mechanistic element of the study. He points to several important aspects that can be learned from the documentation of the experiment. Based on background information and the skeptical comments of epidemiologists, we might add several pathways between treatment and outcome (see Fig. 6.9). The causal graph reveals that the estimate above of the natural indirect effect is not identifed. There is nothing identifed in this system of pathways because too many nodes are unobserved. Let us briefy describe the pathways in Fig. 6.9.

One element of the treatment is the drug intake that seems to effectively de-worm students. The effect of de-worming alone is relatively weak, as the path analysis in Table 6.1 confrms. The drug intake has as least two more effects on attendance that cannot be isolated given the existing data. De-wormed students create spillovers, and spillovers might feedback to the treated. This feedback is problematic because it undermines the assumption of the independence of the treatment group and the control group – the problem that compelled resorting to cluster randomization in the frst place.

Beyond spillovers, the drug intake can create placebo effects. Students feel better because of the drug, irrespective of being de-wormed, which might increase school

<sup>11</sup>An alternative way of modeling these numbers would be to use readymade packages in software such as R or Stata. In Stata, you would use the model builder and simple graph the mediation model. After the estimation of all path-coeffcients, the effects can be decomposed into total, direct, and indirect effects using the teffects command (see Bollen, 1989; Sobel, 1987). Note that this command still assumes linearity and leads to biased estimates in this case.

**Fig. 6.9** Mechanisms in the worm wars

attendance. Since the control group was not treated with a placebo, we cannot estimate the placebo effect. More worrisome is how the research group treated the treatment group beyond the drug intake. The documentation fles list health lectures, wall charts in the schools, training of teachers in the treatment schools, encouragements of the treated students for handwashing, wearing shoes, and avoiding freshwater (see Hicks & Nekesa, 2014, 7).12 This extensive treatment had obvious health effects – including a contribution to de-worming – which suggests that the treated students likely became well aware of being subject to an encompassing treatment package. Thus, at least three more paths follow from that treatment beyond drug intake.

First, the educational elements on health issues might have affected the wellbeing of students besides de-worming, which raises their probability to be present in school. Second, being so obviously treated might activate the Hawthorne effect, the rising willingness of participants to make the experiment a success in light of the efforts experimenters provided for the treated. For example, teachers might just encourage students in the treatment group to show up because they know that school attendance is an important measure (although it has to be noted that the measurement of school attendance was achieved by surprise visits). Third, health education

<sup>12</sup>The educational treatments at the school level were part of a separate intervention of the same NGO and could in principle be controlled based on the data (see Hicks & Nekesa, 2014, 5). In fact Miguel and Kremer condition on those interventions. They write "None of these programs involved health treatments for pupils, and given the cross-cutting design, are unlikely to complicate the identifcation of average treatment effects across PSDP program and comparison schools." Nonetheless, in many specifcations Miguel and Kremer (2004) control for assignment to assistance through these other programs'. Only a page later, they write without considering any potential bias "[t]he educational component of the intervention focused on teaching children about avoiding the disease. Health educators explained the transmission vectors for different types of helminths [one of the relevant worm types] and also promoted hand-washing, wearing shoes, and avoiding contact with fresh water" (2014, 7).

affects the likelihood of being de-wormed besides de-worming drug intake and school attendance. Accordingly, the effect of being de-wormed on school attendance, including the spillover effects, is confounded. Knowing about the direction of the infuence of health education (increasing de-worming and school attendance), the already weak indirect effect of de-worming via drug-intake on school attendance is most likely biased upwards. This perspective reveals that the authors make strong mechanistic inference without ever quantifying the importance of their hypothesized mechanism and without noticing that the indirect effect cannot be precisely identifed, given the observable data at hand.

Such a mechanistic perspective also reveals the standing of the main criticism of the epidemiologists. The Cochrane reviewers classifed the study as very weak in terms of evidence, predominantly because of the lack of placebo treatment of the control group. Indeed, except for the spillover path, all alternative paths between treatment and outcome could have been closed by placebo treatment. The consideration also applies to the educational health elements.

Thus, the mechanistic view qualifes the inference of this landmark study substantially. First, there is a confrmation of a signifcant indirect effect running from the treatment over being de-wormed to higher school attendance. However, this indirect effect explains a very marginal part of the increased school attendance. Way more important are the indirect effects triggered by the entire treatment package beyond the ability to de-worm students. The rise in school attendance is predominantly a composite of different pathways from the Hawthorne pathway over the health education pathway to a potential placebo pathway, combined around 54 times more powerful for school attendance than the de-worming effect. The overall inference to recommend the distribution of cheap drugs might be replaced by the recommendation to offer supposedly more expensive health education.

To be very clear about it, the study of Miguel and Kremer is comparatively wellexecuted and deserves to be praised for the logic of cluster randomization alone. Nonetheless, the mechanistic view on this experiment demonstrates that randomization does not allow for mechanistic inference. While the total effect of the treatment package might still be perfectly identifed, the mechanistic view helps identify which elements of the treatment have created more or less powerful pathways to the outcome. It is extremely interesting to know how much Hawthorne, placebo, or health education contributed to the substantial rise in school attendance, as such effect decomposition can help to improve similar experiments in the future. Like in the lemon-scurvy example, experimenters need to disable these alternative pathways (exclusion restriction) for getting to the correct inference.

A mechanistic view may help to understand supposedly strong effects in wellexecuted experiments. Moreover, it can reveal causal mechanisms where experiments seem to yield nothing.

#### *6.5.2 A Mechanistic View on a Chicago School Reform*

In 1998, US secretary of education, William Bennet, called Chicago's public school the worst of the nation. However, several reforms in the late 1990s moved them from the worst to 'innovators of the nation'.13 One of the core reforms involved a program called 'Algebra for All', compulsory prep courses for ninth graders in high school. At frst sight, the program seemed a success as math scores rose signifcantly. However, the qualifcation of incoming ninth-grade students was already improving due to changes in the K-8 curriculums (an important confounder). Once controlled for this confounder, the reform turned out to be insignifcantly related to the math performance of ninth graders. Here, the story would have typically found its end.

Luckily, Professor Guanghei Hong remained curious because she knew that when Algebra for All was introduced, more than the curriculum changed. The lower-achieving students found themselves in classrooms with higher-achieving students and could not keep up. Detrimental effects for students and teachers caused by mixed classes compared to remedial classes are well-known. In short, Mrs. Hong was suspicious of the unanticipated side effects of the treatment package. Testing the classroom environment as a mediator between reform and outcome clearly showed that this pathway had negative consequences. Once taken into consideration, the direct effect turned positive. The lesson seemed clear: removing the mixed classes and keeping the prep courses was the logical consequence and created a success story of the modifed Algebra for All program.

Students in Chicago signifcantly benefted from a mechanistic view on an education program that has, at frst sight, falsely been considered a failure. We learn from this example that different mechanisms can cancel each other out ("opposing mediation" as in Kenny [1998]), which demonstrates that even a null fnding based on a randomized treatment can be worth considering with closer scrutiny on the level of mechanisms. The Algebra for All example is similar to the discredited causal link between lemons and scurvy prevention, although its revitalization took place in a substantially shorter period.

#### **6.6 Thou Shall Not Raise Causal Illusions**

Scholars of pathways have revolutionized our view on causal identifcation. The counterfactual perspective on pathways reveals that fundamental problems of causality – asymmetry and confounding – can logically be solved by closing either the back- or the front-door. This perspective embraces conventional counterfactual causal inference such as randomization or quasi-experiments. Causal graphs help to make its logic and assumptions very transparent. Applying the logic of the

<sup>13</sup>One of its inventors, Arne Duncan, became secretary of education under Barack Obama.

back-door to generally defned causal mechanisms reveals two things. First, conventional approaches are ill-suited for identifying causal mechanisms as they can mistake their structure. Pathway analysis solved that issue by focussing on indirect effects. This perspective reveals that causal mechanisms can be quantifed by nonparametric comparisons of observable with counterfactual probabilities. To lend these numbers a causal meaning depends on a simple assumption: path estimates in a system of pathways must be unconfounded.

This unconfoundedness can unfortunately not be fully ensured by randomization – although the randomization of the treatment helps a lot to block all paths running into the candidate cause. Moreover, causal mechanisms can only be identifed if a theoretically exhaustive causal system is given and all confounders are observed and conditioned on. Based on a theoretically defned causal system, effective strategies of de-confounding can be determined. The complexity of the task becomes apparent when we remind ourselves of the problem of the collider bias. The collider bias is an instance of a single confounded path in a system of pathways, leading in the worst of events to completely misleading estimates of the indirect and direct effects – such as when smoking mothers are understood to increase the survival rate of their children. Besides, complex pathways with sequences of many mediators can complicate the identifcation task and the chances for false inference multiply.

The pathways perspective on the identifcation of causal mechanisms is logically simple. However, mechanisms can only be identifed given a theoretically exhaustive causal system where all the variables required to close the back-doors are measured, free of error, and conditioned. Empirically, these assumptions are hard to meet. Thus, research relying on pathways or causal mechanisms should avoid creating the causal illusion that the back-door criterion will easily tackle identifcation tasks.

The greater strength of the pathway approach is not to deliver a readymade tool for causal inference but a perspective that can boost the transparency over what is needed to identify a mechanism causally. It complements standard approaches of causal inference that typically seek to identify total effects. Analyses of mechanisms searching for indirect effects ask a deeper form of why. Preliminary answers to these deeper questions can at times be very generic, such as a single mediator connecting cause and outcome, and at times can also span to very complex systems of pathways. However, even the most generic mechanism can reveal a great deal. Thinking of lemons' ability to prevent scurvy, smoking mothers to decrease the survival rate of their children, the capacity of de-worming to increase school attendance or preparation courses to improve school performance. In all examples of this chapter, evidence on a single mediator considerably qualifed the inference of a cause–effect relationship.

Despite the capacity of a mechanistic view to qualify the inference of even wellexecuted experiments, the added values are complementary. Randomized treatments facilitate the identifcation of causal mechanisms because important sources of confounding are erased by design. Mechanisms, in turn, improve the exercise and inference on well-executed experiments too. The more we know about the mechanisms, the better we can identify total effects.

#### **Suggested Readings**

There are three books of great help to understand causal mediation. The most encompassing work on causal mediation analysis, including moderated mediation, is most likely VanderWeeles' book *Explanation in causal inference: methods for mediation and interaction*, published in 2015 by Oxford University Press. Although probably the most encompassing, it addresses the issue from the perspective of biostatistics. Easier access to causal mediation can prove Chapter 9 on *Mediation: The search for a mechanism* in Pearl and Mackenzie (2018), published by Basic Books. The entire textbook can be highly recommended to cast light on recent developments in causal identifcation against the background of the history of statistics. Finally, Chapter 10 on *Mechanisms and causal explanation* in Morgan and Winship (2015) lies somehow in between VanderWeeles' equation-based insights and Pearl and Mackenzie's captivating narrative. Their entire book on *Counterfactuals and causal inference* can be recommended, as it covers virtually all causal identifcation tasks from the perspective of the social sciences while preserving a deep commitment to graph theory and counterfactual thinking.

#### **Helpful Websites**

Beyond books, there are two highly informative websites on causal mediation. The one by David Kenny provides regular updates on mediation analysis and also covered issues in causal mediation (http://davidakenny.net/cm/mediate.htm). Alternatively, Columbia University provides information on causal mediation, including a recorded lecture of VanderWeele based on the Harvard Seminar Series in Biostatistics (https://www.publichealth.columbia.edu/research/populationhealth-methods/causal-mediation#websites).

#### **Software Recommendations**

Causal mediation, the identifcation of mechanisms, or causal pathway analysis are relatively new and characterized by rapid development. Formulas, methods, and software applications change accordingly. Nonetheless, several software packages have proven extremely useful.

	- the *mediate()* function estimates the natural direct and indirect effects based on Pearl's mediation formula,
	- X-M interaction may be conducted by the function test *TMint()* (signifcant fnding implies that the no X-M interaction assumption does not hold).
	- the sensitivity analysis function *medsens()* allows for investigators to examine, through simulations, the robustness of their fndings to potential unmeasured M-Y confounders.

Results for all analyses are displayed using the *summary()* and *plot()* functions

	- The SAS macro is a regression-based approach to estimating controlled direct and natural direct and indirect effects.
	- The macro can handle virtually every distributional and link assumption (compare Valeri et al., 2013).
	- *paramed* package (no sensitivity analysis) (Emsley et al., 2013).
	- *ldecomp* (no sensitivity analysis) (Buis, 2010).
	- *medeff* (sensitivity analysis) (Hicks and Tingley, 2011).
	- *gformula* (helpful in case of post-treatment and time-varying confounding) (Daniel et al., 2011).

#### **Review Questions**


#### **References**


*parative politics and international relations* (pp. 145–175). Palgrave Macmillan. https://doi. org/10.1057/9780230607507\_6


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 7 Testing Joint Suffciency Twice: Explanatory Qualitative Comparative Analysis**

**Alessia Damonte**

**Abstract** Standard Qualitative Comparative Analysis (QCA) applies an eliminative cross-case algorithm to identify which combinations of factors are logically associated with an outcome in a population. As such, it suits the purpose of pinpointing the conditions under which an outcome occurs or fails. However, the explanatory import of its fndings only follows if the algorithm identifes theoretically *interpretable*, logically *valid*, and empirically *plausible* causal compounds.

The chapter provides an essential guide to designing an explanatory QCA that meets the three credibility requirements at once. Section 7.2 addresses how to develop starting hypotheses consistent with the assumptions of complex causation to preserve theoretical interpretability. Section 7.3 introduces the Boolean algebra required to model a hypothesis and fnd which part supports the explanatory claim in the cases at hand. Section 7.4 addresses the issue of gauging conditions to ensure the empirical plausibility of the analysis. Last, Sect. 7.5 summarizes the protocol, illustrated by the replicable example in the online R fle.

#### **Learning Objectives**

After studying this chapter, you will be able to:


University of Milan, Milan, Italy e-mail: alessia.damonte@unimi.it

A. Damonte (\*)

#### **7.1 Introduction**

Qualitative Comparative Analysis (QCA: Ragin, 1987/2014, 2000, 2008; Duşa, 2019; Oana et al., 2021; Mello, 2021) stands amid the suite of causal techniques for three main reasons that drive as many questions.

First, QCA moves from the default assumption that causation lies in compounds or teams of conditions. Its solutions entail that things happen when all the "right" conditions are given together, like in a chemical reaction (Mackie, 1965, 1974; Cartwright & Hardie, 2012). The frst question of explanatory QCA asks how to ensure that results are interpretable "recipes" for the outcome.

Second, QCA originally revolves around a pruning algorithm. It compares confgurations that meet regularity requirements of association with an outcome to drop irrelevant conditions, along the lines of a most-dissimilar case design (e.g., De Meur & Berg-Schlosser, 1994), albeit run twice. The second question asks how the technique can be geared toward pinpointing valid causal compounds despite the shortcomings of such a design (e.g., Geddes, 1990; Most & Starr, 2015; Krogslund et al., 2015).

Third, QCA's solutions hold at the levels of both the population and individual cases. Such a peculiarity is based on gauging operations that preserve quantitative and qualitative information. These operations are an integral part of the analysis and bind fndings to analytic units. The third question asks how these operations affect the tenability of solutions.

These three questions are addressed in Sects. 7.2, 7.3, and 7.4, respectively. Section 7.5 summarizes the protocol illustrated by the online R fle.

#### **7.2 Interpretability**

The recognized hallmark of QCA lies in its assumptions that causation is an asymmetric, conjunctural, and equifnal phenomenon (Ragin, 2008; see also Rosenberg et al., 2017). *Asymmetric* means that causation has a direction and proceeds from "causes" to "effects" as a relationship of dependence or conditionality ahead of temporal considerations. *Conjunctural* refers to the frst reason for asymmetry: the actual cause is a compound and consists of a team, bundle, or package of contributing factors. *Equifnal* recalls the second reason for asymmetry: different compounds can yield the same outcome. These assumptions chime with mechanistic considerations on the ultimate shape of causation (e.g., Befani, 2013; Mahoney, 2021; Chap. 2).

#### *7.2.1 Mechanisms and Machines*

QCA assumes that the factors responsible for an outcome are many and related to each other as the constituting parts are to their whole. Moreover, it allows factors have substitutes without loss of effectiveness for the causal compound (Mackie, 1966; Cheng, 1997; Cartwright & Hardie, 2012).

The textbook illustration of such a parts-to-whole relationship offers heat, oxygen, fuel, and defective or no sprinklers as the compound accounting for fre. These circumstances provide the complete set of relevant conditions under which the process of combustion must initiate (Salmon, 2020). Thus, they form a causal team based on the process that they explain.

The process also clarifes the general relationship between components, teams, and outcomes. In the textbook example, combustion results in a fre when the whole team of circumstances is given in the same place and the right state—present heat, fuel, oxygen; absent or defective sprinklers. The surefre or *suffcient cause* of the outcome is the right bundle. However, the right circumstances can take many actual shapes. For instance, a lightning bolt, a short circuit, or a lit match can all be equivalent sources of heat. Any actual bundle, then, is *unnecessary* as such. Besides, the process fails when any circumstance is given in the wrong state—poor oxygen, no fuel, or no heat all prevent combustion, while a working fre system suffocates it. Any element of the compounds, then, is a counterfactually vital—and hence, *necessary—*component of the team, despite it alone being insuffcient to yield the outcome. The elements of the compound are "partial causes" or "*inus* conditions"—*inus* being the acronym of the *Insuffcient* but *Necessary* part of an *Unnecessary* but *Suffcient* team.

Bundles of *inus* conditions seldom capture a generative process directly (see Chaps. 8, 9, and 10). Instead, they can capture the set of right circumstances as "nomological machines"—that is, as "suffciently stable" arrangements of triggering, enabling, sustaining, and shielding conditions underlying the generative process (Cartwright, 1999: 49, 2017). A nomological machine is such that its components together make other factors irrelevant before the same type of outcome across time and space. Therefore, a nomological machine is the specifed explanation of a regular behavior independent of the remaining context (Craver & Kaplan, 2020). Moreover, it provides the theoretical construct that affords counterfactual evidence about the contribution of single components across cases.

#### *7.2.2 Operationalizing Typological Theories*

Typological theories provide a renowned starting point for developing confgurational explanations (e.g., Elman, 2005). Such theories prove especially fruitful as they enable modeling of the alternative causal bundles as different settings of the same factors.

Some theories are consistent "explications" of a driving concept. For instance, Pahl-Wostl (2008) takes "regimes" as the driving concept. She defnes water management regimes as the alignment of governance style, type of sectoral integration, scale of analysis and operation, information management, plus fnance and risk management. Huntjens et al. (2011) operationalize the setting of these structural dimensions for two polar types of regimes—the "market-based" and the "integrated adaptive"—then run a QCA to establish the features that account for the diversity in the policy-learning capacity of water management systems when faced with climate

change challenges. In a similar vein, Colby (1991) builds on the concept of "policy paradigms." He stipulates that the compatibility of environmental and economic policy goals depends on the alignment of policy ideas and policy tools. Thus, "frontier economics" and "deep ecology" establish the trade-off between economic growth and environmental preservation, while "environmental protection," "resource management," and "eco-development" make room for their coexistence and integration. Damonte (2013) operationalizes these alternative paradigms as different settings of the same bundle of policy tools and identifes the confgurations that account for the green decoupling of economic growth from pollution.

Other confgurational hypotheses integrate heterogeneous streams of literature into a consistent explanatory whole. For instance, Sabatier and Mazmanian (1980) reason that the many accounts of the success and failure of policy implementation can be reduced to the consistent interplay of three dimensions: problem tractability, administrative effectiveness, and political support. Hinterleintner et al. (2016) operationalize the components of each dimension and run a QCA that explains the differences in the IMF's evaluation of austerity programs as differences in the credibility of national implementations. Theoretical integration can also be purposefully operated within the study. As an example, Lauri et al. (2020) integrate theories linking the defamiliarization of care work and gender equality with theories on the gender division of labor as embedded in different types of welfare systems. On this basis, they provide a thorough operationalization of childcare policies as bundles of tools that enforce different gender norms. QCA is applied to identify which tools, linked to the norms of which type of welfare system, yield high gender equality and which endanger the goal instead.

#### *7.2.3 Assembling Confgurational Hypotheses*

A confgurational hypothesis can also be crafted after a reasoned selection and integration of statistical "determinants." Surveys of scholars' practices (Amenta & Poulsen, 1994; Berg Schlosser & De Meur, 2009) pinpointed four selection strategies. The "comprehensive approach" includes all the factors from all the relevant theories; the "perspective approach" selects single variables that represent major theories; the "signifcance approach" only focuses on statistically signifcant variables; the "second look" approach mixes statistically signifcant variables with theoretically meaningful factors that did not survive those same tests.

However, none of these strategies is proven to yield proper confgurational hypotheses unless the selected factors can be related to the unfolding of a generative process as actors' constraints and opportunities. To witness, Stiller (2017) explains governments' success in adopting major welfare reforms as the interplay of policymakers' strategies—identifed in ideational leadership, concession making, and blame avoidance—with key background features that make these strategies adequate—namely, the stage of the election cycle and the government's position toward the national welfare system. Similarly, Ansell et al. (2020) account for stakeholders' participation in collaborative governance as the result of motivations—that is, perceived incentives, interdependence, trust, and purpose—and governance's support of motivations—through leadership services, opportunities to build relationships, and structures for pooling information.

A confgurational hypothesis may also follow from problematizing correlational theories. Kogut and Ragin (2006) focus on the theory linking high economic development, thriving fnancial markets, and common law institutions. The confgurational hypothesis develops from the consideration that the causal chain is underspecifed. National economies, they reason, may still thrive despite poor fnancial markets if legality is ensured. Moreover, the effectiveness of common law institutions beyond their original contexts depends on their interplay with existing legal traditions. Thus, they run two QCAs that employ common law, features of the institutional "transplant," and commitment to the rule of law to account for differences in GDP per capita and, separately, in the dimension of the domestic fnancial markets, to check whether the two explanations overlap.

In short, the fundamental criterion for selecting an interpretable candidate *inus* factor is functional. It consists of whether one can develop *directional expectations* about the factor's contribution to the setting that compels and protects some causal process of interest. The expectation should support the claim that, were the factor given in the right state and in the right team, the process to the outcome would certainly follow. As we will see in Sect. 7.3.2, these directional expectations play a crucial role in the analysis as they establish the plausibility of counterfactual assumptions.

#### **7.3 Validity**

The validity of inferences about *inus* hypotheses depends on the algebra deployed to make them testable. Such a suitable algebra should allow factors to


Boolean algebras can easily render these states and relationships. Introduced as primary devices to analyze human reasoning about the world (De Morgan, 1847; Boole, 1853), their structures support a twofold reading (Stone, 1936)—logical, and set-theoretical.

#### *7.3.1 QCA's Algebra*

Like any other, QCA's algebra is a language of literals and operators suitable to render complex relationships according to fundamental rules.

#### **7.3.1.1 Literals**

Boolean algebras use "literal symbols" to indicate factors as attributes or states of a unit of observation. A literal stands for a name or an adjective denoting "either a thing or some quality or circumstance belonging to it" (Boole, 1853:27). QCA borrows the convention and indicates a state with an uppercase letter. Thus, *A* reads ⌜ *A* present⌝ or ⌜ *A* positive⌝ or the predicate ⌜ is *A*⌝ . The literal provides an empty placeholder for whatever attribute we consider as the candidate *inus* condition—such as "infammable" referred to a material; "hierarchical" to a governance structure; "affuent" to a society; "independent" to a voter.

Once defined, a literal establishes the similarity of any units of observation *ui* to which it applies. In Boole's original proposal, and all the basic operations of QCA, such a recognition raises a class, that is, an *idempotent* collection of units. Idempotency means that, in contrast to probabilistic samples, classes satisfy the logical rule dubbed *dictum de omni*: that which can be said of the whole, it also holds for each of its parts. Boole renders idempotency as in Eq. (7.1):

$$A^2 \, := A \,\, \tag{7.1}$$

where ≔ indicates a stipulation and reads ⌜ is by defnition equal to⌝ . As the only two numerical values that satisfy the stipulation are 1 and 0, Boole's literals can only take these two values—and the basic operations in QCA share this bivalent assumption, too.

These values convey two separate readings of the relationship between a unit and a literal:


The logical understanding captures the literal as the *intension* or quality of a unit. In contrast, the set-theoretical understanding captures the literal as the *extension* of the quality across the units in a universe. Operationally, the intension is decided by gauging rules—for instance, on defning which manifestations and intensity make it true that a unit ⌜ is *A*⌝ . Extension, on the other hand, is decided by counting—for instance, the number of units in the universe that ⌜ are *A*⌝ , which corresponds to the *cardinality* of class *A*. In bivalent Boolean algebra, the two readings overlap, making logical inferences especially straightforward.

#### **7.3.1.2 Operators**

The Boolean operators relevant to *inus* hypotheses correspond to the logical connectives ⌜ *not*⌝ , ⌜ *and*⌝ , ⌜ *or*⌝ , ⌜ *only if*⌝ , ⌜ *if*⌝ and the set-theoretical relationships of *difference*, *intersection*, *union*, and *superset/subset*.

#### Negation

The connective ⌜ *not*⌝ denies the literal. The Boolean notation renders it with a bar above the uppercase literal to which it applies; in QCA, also common is the tilde before the uppercase literal, or the use of the lowercase literal. Thus, *A A* , , ~ *a* all read ⌜ is not-*A*⌝ .

The logical negation transforms a unit's truth value into its opposite, calculated as in Eq. (7.2). The set-theoretical reading establishes the negation of a set is the collection of units that are excluded from that set. Therefore, the negated set *A* corresponds to the difference (indicated by the backslash \) between the universe U and set *A*, as in Eq. (7.3):

$$
\overline{A}\_{\ell} \coloneqq 1 - A\_{\ell} \tag{7.2}
$$

$$A \coloneqq \mathbb{U} \backslash A \tag{7.3}$$

Equations (7.2) and (7.3) indicate that, by defnition, a literal and its negation are mutual *complements*. The enforcement of this defnition depends on gauging operations—an issue addressed in Sect. 7.4.

#### Joint Occurrence

These correspond to bundles of literals connected by the ⌜ *and*⌝ operator. In logic, the operator is a wedge (∧); in set theory, it is a cap (∩). In QCA, the operator is a dot (⦁) or a star (∗) although the connecting symbol may be omitted.

Two implications are worth noting. Permutation and grouping are irrelevant to ⌜ *and*⌝ bundles: *ABC* means the same as *ACB* and *A B C* as the resulting class clusters the same units. In short, the Boolean ⌜ *and*⌝ supports the commutative and the associative rule. Therefore, bundles are blind to the time dimension of sequences; instead, they emphasize the joint occurrence or interaction of attributes in a unit.

Logically, the ⌜ *and*⌝ operator raises a *conjunction*. The underlying rule establishes a conjunction as true when each of its conjuncts is true. The rule is also known as "*the weakest link*": the conjunct with the lowest truth value defnes the truth value of the compound.

Applied to a single predicate and its negation, the rule renders the logical *principle of non-contradiction*. As summarized by Eq. (7.4), the principle states that a predicate and its negation cannot be true of the same unit at the same time in the same sense. Set-theoretically, the principle is met when the intersection of a set and its negation is empty (∅), as in Eq. (7.5). The principle offers the frst criterion of validity: it commits to rejecting inferences that build on, or lead to, *contradictions*.

$$A \wedge \overline{A}\_i \coloneqq 0 \tag{7.4}$$

$$A \cap A \coloneqq \bigcirc \tag{7.5}$$

More generally, the weakest link of the *i*-th unit can be calculated as the minimum of its truth values in any of the 1 ≤ *j* ≤ *K* conjuncts, as in Eq. (7.6):

$$
\bigwedge A\_{\circ} = \min \left( A\_{\circ 1}, \ldots, A\_{\circ k} \right) \tag{7.6}
$$

Therefore, in a universe of *N* units, the cardinality of the intersection of the *k* literals of interest corresponds to the sum of the 1 ≤ *i* ≤ *N* units' weakest links as in (7.7):

$$\bigcap A\_{\;j} = \sum\_{i=1}^{N} \min\left(A\_{\;i1}, \dots, A\_{\;iK}\right) \tag{7.7}$$

Alternatives

These arise when literals are connected by the operator ⌜ *or*⌝ . In QCA, the operator is a plus symbol (+) and never omitted. Logic indicates it with a vee (∨); set theory with a cup (∪). Class idempotency makes permutation and grouping irrelevant to alternatives, too.

Logically, the ⌜ *or*⌝ operator raises a *disjunction*. The underlying rule establishes the disjunction as true when at least one of its disjuncts is true. The rule can be dubbed "*the strongest link*": the disjunct with the highest truth value defnes the truth value of the whole compound.

Applied to a single predicate and its negation, the rule renders the logical *principle of the excluded middle*. As summarized by Eq. (7.8), the principle states that, necessarily, either a predicate or its negation is true in a unit, so that the disjunction of the two raises a non-informative tautology. Set-theoretically, the principle is met when the union of the set and its negation returns the universe, as in Eq. (7.9).

$$A\_i \lor \overline{A}\_l \coloneqq 1 \tag{7.8}$$

$$A \cup \overline{A} \coloneqq \mathbb{U} \tag{7.9}$$

More generally, the strongest link of the *i*-th unit can be calculated as the maximum of the truth values of any of the 1 ≤ *j* ≤ *K* disjuncts, as in (7.10):

$$\bigvee A\_{\boldsymbol{\eta}} = \max \left( A\_{\boldsymbol{\alpha}1}, \dots, A\_{\boldsymbol{i}\mathbf{k}} \right) \tag{7.10}$$

Therefore, in a universe of *N* units, the cardinality of the union of the *K* literals of interest corresponds to the sum of the 1 ≤ *i* ≤ *N* units' strongest links, as in (7.11):

$$\bigcup \mathcal{A}\_{\boldsymbol{\beta}} = \sum\_{i=1}^{N} \max \left( \mathcal{A}\_{i1}, \dots, \mathcal{A}\_{i\mathbf{k}} \right) \tag{7.11}$$

#### Necessity and Suffciency

The reliance of QCA on the assumptions of *inus* causation gives center stage to the concepts of necessity and suffciency.

Mackie (1974) illustrates them with the different behavior of coin-operated vending machines. A "suffciency machine" always drops a snack for a coin, and sometimes it drops one without apparent reason, too. A "necessity machine" never drops a snack without a coin, and sometimes the coin fails. Last, one and only one snack for each coin is the behavior of the perfect "necessity-and-suffciency machine." These intuitions capture both set-theoretical and logical relationships between an observed input, or antecedent (the coin), and an observed output, or consequent (the snack), connected by an unobserved—but possibly observable—mechanism.

As for notation, QCA indicates necessity with an arrow running from the outcome to the cause and suffciency with an arrow running from the cause to the outcome. Thus, *A* → *B* reads ⌜ *A* is suffcient to *B*⌝ ; *A B* ← reads ⌜ not-*A* is necessary to not-*B*⌝ .

Set-theoretically, the *necessity* of *A* to *B* corresponds to *A* being a *superset* of *B*, indicated as *B* ⊂ *A*. The relationship is satisfed when *all the B are also A* although there can be instances of *A* in the universe that do not display *B*. This corresponds to the logical situation in which being *B implies* being A or, more compactly, ⌜ *B*, only if *A*⌝ . The hallmark of necessity is the impossibility of the outcome in the absence of the factor, as in (7.12). Set-theoretically, it means that the proof of the necessity of *A* to *B* in the universe comes from the empty intersection in (7.13).

$$A\_i \wedge B\_i = 0\tag{7.12}$$

$$
\overline{\overline{A}} \cap B = \bigcirc \tag{7.13}
$$

Set-theoretically, the *suffciency* of *A* to *B* corresponds to *A* being a *subset* of *B*, indicated as *A* ⊂ *B*. The relationship is satisfed when *all the A are also B*. In short, suffciency renders the intuition of *A* as the constant antecedent condition of *B*. Logically speaking, it corresponds to saying that, for any *ui*, ⌜ *B*, if *A*⌝ without exceptions. The hallmark of suffciency coincides with the impossibility that the outcome fails when the factor is present, summarized by requirement (7.14) and its set-theoretical translation (7.15):

$$
\overline{B}\_l \wedge A\_l = 0 \tag{7.14}
$$

$$
\overline{B} \cap A = \mathcal{Q} \tag{7.15}
$$

#### **7.3.1.3 Truth Tables**

Stipulations and rules construe valid logical inferences as the calculus of truth values, visualized with the aid of a *truth table*. These tables clarify the possibilities that the selected literals make available ahead of observation. Logic sees it as the exhaustive catalog of the combinations of the literals' truth-values (Wittgenstein, 1922). Probabilistic theories dub such a structure "*sample space*" and understand it as the list of the potential events from random trials (e.g., Clarke, 2020). In any case, this structure reports the maximum diversity that units can display given specifc literals and gauges.

The truth table entails a fundamental sense-making operation (Quine, 1982); thus, in it, each combination of the literals' truth values can be dubbed a *primitive*. The number of primitives depends on the number of literals and truth values under consideration; *K* bivalent literals yield 2*<sup>K</sup>* unique primitives. In the remaining, a truth table will be indicated as *Ω* and its primitives as *ω*.

The shape of truth tables follows conventional rules. The primitives are listed as rows: *ω*1 displays all true literals; ω<sup>2</sup>*<sup>K</sup>* , all false ones (cfr. Duşa, 2019). Each of the remaining columns in the classical truth table is for the *truth function* of a connective, i.e., the truth values that each primitive returns when the connective's rule is applied to the states of its literals.

Table 7.1 displays a truth table of two literals (*A*, *B*) and fve operators to indicate as many relationships—respectively, of conjunction (*and*), disjunction (*or*), necessity (*only if*), suffciency (*if*), plus necessity and suffciency (*iff*).

The values in the truth functions of each operator indicate the type of units that will (1) and will not (0) be observed if the relationship holds in the universe of reference (Sprenger, 2011). These expectations inform the discourse on the threats to the validity of inferences that are currently addressed by either design (e.g., Chap. 3) or model (e.g., Chaps. 6 and 8, Sect. 7.3.2 below).

• The *and* truth function follows from the application of the weakest link rule as in Eqs. (7.6) and (7.7) and returns a single true point in correspondence with the matching primitive (*ω*1 in Table 7.1). Thus, evidence of a conjunction is only provided by the units displaying every conjunct in the right state.


**Table 7.1** Truth table of two literals and fve operators

*Note*: ( \*) observing this primitive makes the statement of suffciency vacuously true


A further note is due about the starred value of *ω*3 in Table 7.1. The instances of this primitive do not contradict the claim of suffciency after the principle that *ex falso quodlibet*—meaning that anything can follow in the units where the antecedent is missing or otherwise false. However, units of this type provide *vacuous* evidence about the relationship (e.g., Salmon, 2020), as they may


Although the exact meaning of a vacuous observation depends on the interpretability of the relationship of interest, it nevertheless makes the problem visible as a formal issue of validity.

• The *iff* relationship arises from the conjunction of the truth functions of necessity and of suffciency. It indicates the identity of the two literals and the overlapping of the respective classes of units in the universe. Thus, the truth function has two false points. In Table 7.1, these correspond to *ω*2 and *ω*3. In short, evidence of any inconsistency in the covariation of the two states challenges the validity of the identity.

QCA does not deploy logic, truth tables, and truth functions normatively. Instead, it relies on them as modeling tools and heuristics for the analysis.

#### *7.3.2 Identifying Valid* **Inus** *Hypotheses*

Logic provides scaffolding and criteria to render an *inus* hypothesis frst, then decide whether it is rightly specifed to the universe under analysis.

#### **7.3.2.1 Rendering Hypotheses**

Logic renders an *inus* hypothesis as a theoretically meaningful yet unwarranted claim about the suffciency of a conjunction of *K* conditions to the occurrence of the outcome *Y*, as in (7.16)

$$\bigcap\_{j=1}^{k} A\_j \to Y \tag{7.16}$$

The formula means that ⌜ were it the case that these *K* conditions together make an *inus* machine, then the outcome should certainly occur in an ideal instance displaying them all in the right state, and fail otherwise⌝ . For it to hold, the starting hypothesis should contain the suffcient bundle to the positive and the negative outcome, which may have different specifcations. QCA acknowledges this fact and addresses the positive and the negative outcomes in separate analyses. Nevertheless, the two sets of fndings are related as long as both follow from the same truth table in which primitives are exclusively assigned to one outcome, and no contradiction is detected.

The value of an explanatory QCA lies in identifying the *plausible* bundle beneath the success and failure of an outcome in the population of interest, to defne the tenability of the starting hypothesis and its underlying theory. Its identifcation procedure addresses validity issues as the underspecifcation or the overspecifcation of the starting hypothesis.

#### **7.3.2.2 Tackling Underspecifcation**

QCA deploys truth tables as a diagnostic device for detecting underspecifcation. Therefore, QCA's truth tables are partially different from those of logic.

A QCA's truth table contains as many columns as *inus* conditions in the hypothesis, plus one for the outcome and at least three additional columns for as many parameters of ft. The truth value of the outcome is the last column to be flled, depending on the researcher's decisions about the parameters, as follows:

#### Decision 1: Frequency Cut-Off

This parameter establishes whether a primitive is observed or realized in the universe of reference based on the minimum number of its "best instances" (Ragin, 2008). A unit is the best instance of the primitive in which it gets a membership scorehigher than 0.5 according to the weakest link rule (7.6).

Units' classifcation yields two kinds of primitives: *observed* or *realized*, and *unobserved* or *unrealized*. The unrealized ones are also known as *logical remainders* and constitute a common occurrence. Although the ratio of units to conditions inevitably plays a role in raising them (Marx & Duşa, 2011), their number is relatively independent of the richness of the hypothesis or the size of the universe. Instead, the logical remainders expose the *limited diversity* of the units under analysis and serve as a source of counterfactual reasoning (Ragin, 2008; see below).

The researcher's decision regarding the frequency cut-off may also increase the number of unrealized primitives. Conventionally, one best instance is enough to declare a primitive realized albeit rare. However, the frequency cut-off can be raised if the numerosity of the population and the gauging strategy suggest a risk of errors in units' classifcation.

Decision 2: The Consistency Threshold

The second of the researcher's decisions on the truth table for a QCA concerns the assignment of the realized primitives to either the positive or the negative outcome. In Standard QCA, the decision mainly follows considerations on consistency.

In line with consolidated axiomatizations (Hájek, 2011), QCA captures the *consistency of the suffciency* of each primitive to an outcome (*S.cons* for short, also known as *incl* for "inclusion": Ragin, 2008; Schneider & Wagemann, 2012; Duşa, 2019) as an extensional gauge that checks for empirical violations of the impossibility requirement in (7.15) through the ratio in Eq. (7.17):

$$S.cons\_{\alpha\_\* \to Y} = \frac{|\alpha\_\* \cap Y|}{|\alpha\_\*|} \tag{7.17}$$

The vertical bars indicate the size of a partition. The denominator of the ratio is for any antecedent of interest—otherwise understood as the number of trials—and here corresponds to the primitive of interest. The numerator is for the number of successful trials, that is, the intersection of the primitive with the outcome. When none of the *N* units under analysis qualifes as an instance of the inconsistent intersection ω\**Y* , the numerator overlaps the denominator, and the *S.cons* gets its highest value of 1.00, which supports the claim that *ω*\* is suffcient to *Y*. The lower the overlapping, the lower the *S.cons* parameter and the credibility of the claim of suffciency.

The detection of critical inconsistencies justifes the dismissal of the hypothesis in the current shape as incomplete or otherwise misspecifed (e.g., Rihoux & De Meur, 2009; Rohlfng, 2020). The textbook illustration comes from a confgurational model applying Lipset's socioeconomic theory of democratization to account for the breakdown of democracy in Europe between the two World Wars. The model yielded a straightforward truth table with a single remarkable contradiction: the German case displayed all the socioeconomic conditions for a thriving democracy, but it experienced a clear regime breakdown. The contradiction disappeared after adding institutional conditions of government stability to the model.

The researcher's decision concerns the value of the *S.cons* below which the inconsistency is severe enough to preclude the assignment of the primitive to the outcome. An established convention suggests setting it at 0.85, although the range of *S.cons* values in the table may justify a different choice. An additional criterion considers "natural gaps"—that is, steep falls in the ordered series of the primitives' *S.cons* values. These gaps may suggest setting the consistency threshold in between clusters of primitives.

The primitives not assigned to *Y* cannot be automatically assigned to *Y* . Instead, the consistency of each primitive has to be tested with both states of the outcome separately. Nevertheless, meaningful solutions can be expected when the realized primitives below the consistency cut-off to *Y* return high *S.cons* values to *Y* . This suggests that the starting hypothesis can account for both the occurrence and the non-occurrence of the outcome consistently.

Decision 3: The Coverage Cut-Off

The least common and last of the possible researcher's decisions concerns the empirical import of the claim of suffciency—how relevant the primitive is to the set of instances of the outcome of interest. The related parameter, dubbed *coverage of suffciency* (*S.cov* for short) is calculated as in (7.18)

$$S.cov\_{\alpha,\rightarrow Y} = \frac{|\alpha\_\* \cap Y|}{|Y|} \tag{7.18}$$

When all the instances of a primitive *ω*\* display the outcome, the numerator in (7.18) equals the denominator, and the parameter takes its highest value of 1.00 supporting the claim that the primitive accounts for any unit with the positive outcome. But the empirical relevance of a factor to an outcome is the extensional gauge of its necessity in the cases at hand. Hence, the *S.cov* of *ω*\* to *Y* gauges the *consistency of necessity* (*N.cons* for short) of the primitive to the outcome. Specularly, the *S.cons* of *ω*\* to *Y* gauges the empirical relevance of the primitive as a necessary compound to the outcome—and hence counts as the *N.cov* of *ω*\* to *Y*.

A primitive's *S.cov* value decreases with the increase in the evidence that the outcome can occur without the primitive. Coverage cut-offs may be established to ensure the analysis is based on suffcient primitives that also are empirically relevant. However, decisions driven by empirical relevance may prove unwise, as even rare primitives may contribute to specify the composition of *inus* machines.

#### **7.3.2.3 Tackling Overspecifcation**

Overspecifcation depends on having included factors in the starting hypothesis that prove irrelevant to account for the units' diversity.

The issue arises as mistaking some features for an *inus* component entrenches solutions in very specifc contexts and unnecessarily reduces their portability (e.g., Craver & Kaplan, 2020; Salmon, 2020; cfr. Álamos-Concha et al., 2021; Chap. 10).

The acknowledged sources of overspecifcation are twofold: irrelevant components, and trivial factors.

#### Irrelevant Components

Quine-McCluskey's *minimizations* provide the standard approach to irrelevant conditions (Ragin, 1987/2014, 2000, 2008). These minimizations identify irrelevant components in the single varying conjunct of two otherwise identical primitives. To witness, the minimization is possible of the primitives *ABCD* and *ABCD* if both display high *S.cons* values to the same outcome. The formal reason is that the two allow the factorization *ABC D D* , where *D D* : by Eq. (7.9). The operation highlights that the *implicant ABC* is suffcient to *Y* regardless of *D*, which can be dismissed as not *inus* a factor.

The adjudication of the *inus* nature of single components may change depending on how minimizations deal with the logical remainders. The Standard Analysis affords three alternative *counterfactual assumptions*, each leading to "solutions" at different degrees of specifcation, as follows:

• *Conservative or complex solutions*. These are returned under the assumption that unrealized logical remainders would have proven ambiguous had they been realized. Hence, minimizations only operate on observed primitives. With high limited diversity, the solutions could be as rich as the disjunction of any realized primitive.

• *Parsimonious solutions*. A superset—and hence, more general in scope—of the conservative solutions, the parsimonious solutions are returned under the assumption that any logical remainder could prove suffcient if matching a realized primitive except for one literal.

The surviving factors are the *inus* components in the hypothesis that are essential to account for the difference between the instances of the successful outcome and the instance of the failed one.

However, parsimonious minimizations can yield gappy explanations. Like the treatment variable in the Potential Outcome Framework (see Chap. 3) or the mediators in Path Analysis (see Chap. 6), the solutions from the parsimonious minimization may capture a causal channel, but certainly dismiss the information about the covariates needed to account for the effect (Damonte, 2021b). The reason is that the parsimonious minimizations drop factors regardless of the plausibility of the logical remainders that they employ.

• *Intermediate or plausible solutions.* These are returned under the assumption that only those logical remainders qualifying as *easy counterfactuals* would have proven suffcient if realized.

To understand the difference between an easy and a hard counterfactual, imagine the following. At the outset, we include condition *A* in the starting hypothesis under theoretical and empirical reasons to assume that it is an *inus* factor. More specifcally, we assume that the condition makes an unknown causal compound *Φ* suffcient to the outcome *Y* when given in a state, say *A*, while in the opposite state, say *A* , it turns *Φ* into a failure machine. In short, we add *A* under the *directional expectations* that


where ⊂ indicates a subset.

After we build and populate the truth table, we fnd the primitive *ω*1 = *ABCD* is observed with an *S.cons* of 1.00 to *Y*, while we do not observe (hence we star) the primitive 9 \* *ABCD*. According to the single difference rule, *ω*1 and ω9 \* can be minimized to *BCD*. However, the minimization entails that ω9 \* is consistent with *Y*, and hence that *A*Φ would yield *Y* if observed. This goes against our directional expectation (*ii*) and makes a *hard* or *implausible counterfactual* of ω9 \* .

Now imagine the primitive <sup>13</sup> *ABCD* is realized with an *S.cons* of 1.00 to *Y*, while the primitive 5 \* *ABCD* is a logical remainder. Again, according to the single difference rule, *ω*13 and ω5 \* can be minimized to *BCD*. The minimization entails that ω5 \* is consistent with *Y* and that *AΦ* would yield the outcome if observed. This agrees with our directional expectation (*i*); hence, ω5 \* qualifes as an *easy* or *plausible counterfactual*.

Intermediate minimizations return solutions from observed primitives and easy counterfactuals only. The factors added to the parsimonious solution terms may not

be essential to preserve the non-contradictoriness of the compounds. As they improve the suffciency of the implicant, they offer a more complete account of why the outcome failed in specifc units while succeeding in others (Ragin, 2008; Fiss et al., 2013; Duşa, 2019; Oana & Schneider, 2018; Damonte, 2021a; cfr. Baumgartner, 2015; Baumgartner & Thiem, 2020).

#### A Note on Ambiguity in Solutions

Regardless of the usage of the logical remainders, it has been emphasized that solutions in Standard QCA may encounter problems of ambiguity as the same primitives to an outcome may yield different prime implicants. To witness, the primitives *ABC AB* , , *C ABC* can legitimately be minimized as *AB* ∪ *A CB* or *AC AB* ∪ *C* . The information is displayed in a *Prime Implicant Chart* that shows which prime implicant covers which primitive, as displayed in Table 7.2.

Originally, the PI Chart was devised to allow the researchers making a decision on which implicants could be retained in solutions in light of their theoretical import. The practice has been deprecated, as cherry-picking implicants may build a confrmation bias into solutions (e.g., Baumgartner & Thiem, 2020; Baumgartner, 2015), and the current good practices require that alternative implicants are reported, too. Besides, the alternative minimizations may contain information of interest for discussion. For instance, in the example above, the two solutions indicate that *A* is always required—it can be an enabling condition—but, in the cases at hand, it obtains in team with *B* or *C*—which can play as triggering conditions. The richer implicants *A CB C* , *AB* add that the one trigger can compensate for the absence of the other. These two richer implicants are currently left implicit by the reporting conventions that reward lean solutions. Under these rules, privileged prime implicants are those terms that, together, maximize the coverage of primitives—as are *AB*, *AC* in Table 7.2. Indeed, the conclusion that the union *AB* ∪ *AC* obtains the outcome does justice to alternative minimizations while logically entailing the richer implicants. Still, the information in the PI Chart deserves some attention, for it may suggest more accurate causal interpretations.


**Table 7.2** Example of Prime Implicant Chart

#### Dealing with Trivial Factors

Trivial factors are degenerate necessary conditions, that is, limiting cases of supersets. These arise when all or almost all the units in the universe of reference make the same state of the condition true—in short, when their distribution is skewed or constant.

Trivial factors can be detected by plugging the size of one condition in the place of the primitive in the formulas of the *N.cons* as in (7.18). When all the instances of the tested condition display the outcome, the numerator equals the denominator, and the parameter takes its highest value of 1.00, supporting the claim that the condition is necessary to the outcome. Conditions with a score of *N.cons* higher than 0.95 can be tested for skewness through a further parameter dubbed *Relevance of Necessity* (*RoN*: Schneider & Wagemann, 2012) and calculated as in (7.19) below:

$$RoN\_{A \gets Y} = \frac{\left| 1 - A \right|}{\left| 1 - A \cap Y \right|} \tag{7.19}$$

The parameter takes its lowest scores when the distribution of the condition by the outcome of reference proves trivial—when the size of 1− *A* is remarkably smaller than the size of 1 *A Y*, indicating the instances of the negative outcome raise independently of the absence of the condition. The standard recommendation is to consider dropping the factors with *N.cons* close to 1.00 and low *RoN* from the hypothesis. Thus, such "analysis of necessity" is a recommended step to be performed ahead of constructing the truth table (Schneider & Wagemann, 2012).

The original expected advantage was of pinpointing those constant conditions that double the number of primitives in the truth table while leaving almost half of them unobserved and lowering the consistency of every solution. However, the dismissal of a quasi-constant may prove unwise if the model requires it to prevent contradictory primitives (Rohlfng, 2020). The essentiality of the contribution can be easily ascertained by verifying whether a change in the consistencies of the primitives occurs after the seemingly trivial condition is dropped from the hypothesis (Damonte, 2021a). Nevertheless, the calculation of the parameters of ft on individual conditions remains a crucial source of information, as their values can support directional expectations or suggest reconsidering them.

#### **7.4 Soundness**

The actual link between sets, predicates, and the real world is decided by how truth values are assigned to literals—that is, by gauging.

The standard assumption in representation measurement theory maintains realworld properties depend on some units' deep structure that we can know indirectly only as meaningful variations in related observable attributes. This theory assumes we can represent these attributes through *numerical images* and capture their variation through adequate scales. Scales warrant that for any manifestation *pi* of the property *P* in the unit *ui* there is a measure *qi* of the image *Q* such that the functional relationship between measures preserves some fundamental relationship in the variation of the attribute.

The seminal work of Stevens (1946) pinpointed four such fundamental relationships: sameness, rank, distance, and proportion, preserved by nominal, ordinal, interval, and ratio scales, respectively. Conventional textbooks have long taught that a hierarchy of scope exists among measurements with the ratio scale at the top as the most "robust" one—i.e., abstracted from actual entities and their contexts. Intended as a prudential rule for naive statisticians (e.g., Luce, 1959), the hierarchy has turned into a canon and, as such, has been disputed since its introduction. Indeed, any measurement entails a *loss function*, and the loss is admissible that allows retaining crucial information (e.g., Guttman, 1977). Thus, prominent comparatists contend that ratio scales prove robust for detecting fne-grained changes, but sacrifce the information on "critical points." The qualitative change that occurs in the state of a unit when the measure of a crucial attribute reaches a special value is better conveyed by nominal scales (e.g., Sartori, 1984, 1991; Collier & Mahon, 1993; Ragin, 2000; Goertz, 2020).

In short, scales entail a trade-off between *precision* and *meaning*. However, the trade-off can weaken when metric variables are remapped as *fuzzy sets*.

#### *7.4.1 Gauging for QCA: The Theoretical Side*

#### **7.4.1.1 The Starting Point**

Zadeh (1968, 1978) introduced fuzzy sets to widen the scope of algorithmic problem-solving. He noted how machines could deliver precise solutions, but limited to trivial problems, while the human brain tackles complex issues through linguistic structures with hazy *hedges* such as ⌜ *very*⌝ , ⌜ *somewhat*⌝ , or ⌜ *almost*⌝ .

Fuzzy scores translate hedges into weights (*μ*) ranging from 0.00 to 1.00 to convey the degrees of membership of *ui* to the set of *A* instances. They, too, understand the membership in a set and its opposite as complements, calculated as in (7.20):

$$
\mu\_{\mu\_{\bar{\alpha}\bar{A}}} = 1.00 - \mu\_{\bar{\alpha}A} \tag{7.20}
$$

where ∈ reads ⌜ *in*⌝ .

The meaning of the relationship between complements is established by a third relevant value, the *crossover*. Conventionally weighing 0.50, the crossover is the point of neutrality and signals a membership neither in the set nor in its complement.

Logically, fuzzy scores capture the *possibility* that the statement ⌜ is *A*⌝ is true for the actual unit *ui*: 1.00 indicates the statement is *certainly* true; 0.00 indicates the statement is *certainly not* true; 0.50 indicates that the positioning of *ui* is *highly*  *ambiguous* given the observation. Therefore, original fuzzy scores defy a strictly bivalent logic. The advantage is that the three points allow alignment of linguistic hedges, sets, and metric variables through a triangular, trapezoidal, or bell-shaped function. This *flter function* maps the raw values *νA*—e.g., age in years—into fuzzy scores *μA*—e.g., membership in the set ‹young›—so that it conveys the certainty that a 16-year-old is in the set and a 36-year-old is almost so.

To map meanings onto fuzzy scores, then, the researcher needs to establish


#### **7.4.1.2 Ragin's Reinvention**

For QCA, Zadeh's original proposal is affected by a twofold ambiguity. First, linguistic hedges are seldom clearly ordered, and a straightforward correspondence with particular fuzzy scores can prove idiosyncratic. Second, triangular, trapezoidal, or bell-shaped relations can make each fuzzy score *μA* correspond to more than one raw scores on *νA*, which makes it hard to retrieve the raw value from the fuzzy score.

Ragin's fuzzy sets avoid these issues with a gauge that, before rendering natural language, includes both pieces of information of interest to comparatists—those of "differences in degree," and of "differences in kind" (Ragin, 2000). His flter functions are monotonic non-decreasing, which re-establishes the isomorphism of raw values, fuzzy membership scores, and selected hedges—as in Table 7.3.

The remapping of raw variables into fuzzy scores is especially illuminating of Ragin's rationale of conversion. He portrays it as an operation of *calibration* defned as the fne-tuning of an instrument to improve the validity of its measurements. Although the concept best applies to continuous variables, the calibration rationale also informs the transformation of qualitative data into fuzzy scores (e.g., De Block & Vis, 2019). Indeed, the instrument to be fne-tuned is the flter function, whose shape can be decided using different methods (Ragin, 2000, 2007, 2008:96; Duşa, 2019).

The *indirect method of calibration* assigns the same "qualitative score" from a scale such as (*c*) or (*f*) in Table 7.3 to groups of cases with similar raw values. Then, the cases' raw scores may or may not be fltered into predicted fuzzy scores through the qualitative scores by fractional polynomial regression.


**Table 7.3** Possible positions of *ui* to *A*, and corresponding membership values *μ<sup>A</sup>*

*Source*: Ragin (2000:156, 2009)

The *direct method of calibration*, on the other hand, stipulates that the flter function is a growth curve of odds. The smoothness of the slopes is decided every time by suitable raw values for *αA*, *γA*, *βA*. These chosen raw scores are pegged to conventional fuzzy values, fxed at 0.953, 0.500, 0.047, respectively. The log-odds of *μα* are ln . . 0 953 1 0 953 <sup>3</sup> , while those of *μα* are ln . . 0 047 1 0 047 <sup>3</sup> ; thus, the fuzzy

membership of the *i*-th unit with raw value *νi* is calculated as in (21) below:

$$\mu\_{i} = \begin{cases} \frac{e^{\frac{\lambda^{\nu\_{i}-\chi}{\alpha}}}}{e^{\frac{\lambda^{\nu\_{i}-\chi}{\alpha}}}}, & \nu\_{i} > \chi \\ 1 + e^{\frac{3\frac{\nu\_{i}-\chi}{\alpha}}{\alpha}} & \\ 0.5, & \nu\_{i} = \chi \\ \frac{e^{-3\frac{\nu\_{i}-\chi}{\beta}}}{e^{-3\frac{\nu\_{i}-\chi}{\beta}}}, & \nu\_{i} < \chi \end{cases} \tag{7.21}$$

Ragin's fuzzy sets can be conceived of as crisp sets weighted by a *classifcation error*. As such, they convey both qualitative and quantitative information, circumventing the trade-off between scales. Indeed, the crisp classifcation still holds with fuzzy scores, following the rule of conversion in (7.22):

$$A\_i = \begin{cases} 1, & \mu\_{i\in A} > 0.50 \\ 0, & \mu\_{i\in A} < 0.50 \end{cases} \tag{7.22}$$

where *Ai* is the crisp membership of the *i*-th unit in the set *A*, while *μ<sup>i</sup>* <sup>∈</sup> *<sup>A</sup>* is the fuzzy membership of the same *i*-th unit in the same set.

The preservation of crisp sets' qualitative information by QCA's fuzzy scores is further ensured by the convention that the crossover shall not be assigned to any

actual unit of analysis—or of dropping the 0.5-instances under the argument that they cannot bring helpful information in the analysis (Ragin, 2008; Duşa, 2019).

Furthermore, the basic rules for calculating intersection and union as in (7.6) and in (7.10) also apply to fuzzy sets. However, fuzzy scores cannot meet the axiom of strong identity (7.1); instead, they follow the more common version (7.23) below, meaning that sameness is preserved for units with the same score.

$$\mathbf{A}\_{i} \coloneqq \mathbf{A}\_{i} \tag{7.23}$$

The principles of non-contradiction and excluded middle again hold with fuzzy scores in a crisp understanding, as clarifed by (7.24) and (7.25):

$$
\mu\_{i \in \{A \cap \overline{A}\}} < 0.5 \tag{7.24}
$$

$$
\mu\_{\mu\_{\{A \cup \tilde{A}\}}} > 0.5\tag{7.25}
$$

It is worth noting that the size of a fuzzy union calculated by (7.6) is usually smaller than its crisp versions, while the size of a fuzzy intersection calculated by (7.10) is usually larger than its crisp version due to the *residuals* that fuzzy scores leave in the partition.

#### **7.4.1.3 Fuzzy Suffciency and Necessity**

With fuzzy scores, subset relationships are established as the *containment* (Ragin, 2000; cfr. Zadeh, 1978) of membership functions.

Therefore, fuzzy-set suffciency is captured by Eq. (7.26):

$$
\mu\_{\iota \simeq \alpha\_{\bot}} < \mu\_{\iota \circ \iota} \tag{7.26}
$$

Equation (7.26) entails that, if we plot our units on a Cartesian plane defned by the membership scores in *ω*. as the x-axis and the membership scores in *Y* as the y-axis, if *ω*. is suffcient to *Y*, it distributes the units *above* the bisector in an *uppertriangular* shape.

Instead, fuzzy-set necessity corresponds to (7.27):

$$
\mu\_{i \in \alpha\_{\square}} > \mu\_{i \circ \Upsilon} \tag{7.27}
$$

Equation (7.27) means that the antecedent *ω*. that is necessary to *Y* distributes the units *below* the bisector in a *lower-triangular* shape.

By extension, the relationship of necessity and suffciency arises when the units' membership scores in a primitive (or implicant, or condition) equal those in the outcome, distributing the units *along* the bisector in a *linear* shape.

The *S.cons* parameter preserves its meaning with fuzzy scores, although they can blur the *recognition* of violations as the residuals *i Y <sup>Y</sup>* infate their values. The *Proportional Reduction of Inconsistency* (*PRI*: Ragin, 2008; Schneider & Wagemann, 2012) has been introduced to defate and complement the information from the *S.cons* calculated with fuzzy scores. The parameter builds on the rationale of the proportional reduction of error commonly employed to determine whether the information about *A* improves our prediction of *Y* (e.g., Menard, 1995). It reads as in (7.28):

$$PRI\_{\alpha\_\* \to Y} = \frac{\left| \alpha\_\* \cap Y \right| \neg \left| \alpha\_\* \cap Y \cap \overline{Y} \right|}{\left| \alpha\_\* \right| \neg \left| \alpha\_\* \cap Y \cap \overline{Y} \right|} \tag{7.28}$$

where the vertical bars again indicate the size of the fuzzy partition as the sum of the units' fuzzy membership scores in the partition—such that, for instance, \* : \* *<sup>i</sup> <sup>i</sup> N* . <sup>1</sup>

The set-theoretical task of the *PRI* is to establish whether the conditional relationship holds, net of fuzzy residuals. It takes the same value as the *S.cons* when the size of the residuals is null *Y Y* 0 0. . 0 It degenerates when the units systematically display higher residuals than membership in the primitive: *i Y Y i* \* . Last, it takes lower values than the *S.cons* when the units' residuals are non-null and lower than the membership in the primitive: <sup>0</sup> *i Y Y i*\* .

A *PRI* value sensibly lower than the corresponding *S.cons* points to inconsistencies that may justify the exclusion of the primitive from minimizations—or the reconsideration of gauges, conditions, or the starting hypothesis.

#### *7.4.2 Gauging for QCA: The Empirical Side*

Whether fne-grained membership scores properly render an *inus* factor only depends on how we construe our gauge—here, on how we set the thresholds. Thresholds elicit a solution to the problem of aligning the extension and the intension of an attribute (Quine, 1982; Sartori, 1984; Goertz, 2020).

A theory-driven approach to the problem clarifes the intension frst to prevent the risk of stretching attributes beyond their meaning, which would introduce more hidden heterogeneity than would be desirable for the analysis (see Chap. 10). At the same time, thresholds may spoil the analysis when they enforce some ideal yardstick that none of the units can meet. In short, theoretical thresholds can become useless when decisions are not fne-tuned to actual diversity.

QCA scholars have developed several recommendations to balance these opposite risks. The recommendations assist the researcher in tackling three intertwined problems—namely, unit selection, the operationalization of causal properties, and the identifcation of thresholds that align meanings and empirics. In actual research, the point of attack may change; however, the resulting membership scores provide a single solution to all three issues—likely, after some iteration.

#### **7.4.2.1 Establishing the Universe of Reference**

As in any technique, units of observation provide as solid an empirical ground to the analysis as the criteria for their selection. Such criteria should prevent or minimize the later rise of threats to credible results (e.g., Geddes, 1990; Goertz, 2020).

In explanatory QCA, case selection has to ensure enough diversity to capture the causal facts of interest. Thus, the criterion cannot exclusively focus on the dependent or the independent. Units selected on the outcome of interest would artifcially prevent inconsistencies—thus making the validity of results undecidable. On the other hand, units selected on the factor of interest would turn it into a constant background feature and make its causal contribution undecidable. Hence, the frst criterion that unit selection shall meet is the *variability* in realized states and combinations of factors.

The broadest variability follows from open universes, but open universes may endanger the preservation of meaning (i.e., Ragin, 2008). Geographical, historical, and cultural boundaries provide the closure of the units' heterogeneity required for making interpretable decisions about thresholds. Indeed, different *α*, *β*, *γ* may be needed to establish whether a country qualifes as <RICH>, <DEMOCRATIC>, or <EQUAL> in different world regions and time frames. Therefore, the second and related criterion for unit selection consists of fnding the meaningful *scope condition* that encloses the universe of reference and ensures interpretable membership scores. In short, the correspondence of meaning and numbers comes at the cost of a restriction in the scope of the analysis—and in the generalizability of results (e.g., Goertz, 2017; Walker & Cohen, 1985; Verweij & Vis, 2021; Findley et al., 2021). The limitation, however, might not apply to the starting explanatory hypothesis, which may travel farther than its operational specifcations.

#### **7.4.2.2 Operationalizing Intension**

The operation of connecting gauges and attributes meaningfully is seldom straightforward. Again, it opens to two opposite risks of providing too a specifc or generic defnition of an attribute (e.g., Sartori, 1984; Ragin, 2008).

#### Hyper-Specifcity

The fallacy of composition occurs when we recognize each "token" empirical manifestation as a different property and build a plethora of conditions with too narrow an extension (e.g., Menzies, 2004; Craver & Kaplan, 2020; cfr. Chap. 10). The problem can be solved by recognizing functional equivalences, climbing the ladder of abstraction, and gathering functionally equivalent manifestations under a single label.

Verba (1967) elaborates on the point by discussing how case-based evidence can be turned into a causal factor. From the historical report on how the eruption of Mount Vesuvius had a signifcant impact on the stability of the Pompeiian political system, we may identify either ‹eruption› or ‹calamity› as a relevant *inus* factor; however, the latter includes the former and accommodates a broader number of functionally alternative sources of disruptions, thus widening the scope of comparisons.

According to Verba, an even better operationalization shifts the attention from contextual conditions to the properties of the unit of analysis. Instead of gauging the sources of disruption, the operationalization can narrow on those resources and arrangements that make the system respond to disruption effectively. From this viewpoint, ‹resilient› better contributes to an explanatory theory of political systems' stability than ‹calamity›. The system attribute can apply to the Pompeiian case, but travel farther across contexts.

#### Hyper-Generality

The second and opposite problem arises when the properties are encompassing to the point of losing their analytic capacity.

The problem often arises when the available measure of a concept is a composite of predictors, enabling factors, proxies, outputs, and outcomes. Such assorted content can make these composites apply "*everywhere*, as any universal should" but also "to *everything*." As a result, we incur "theoretically, a 'nullifcation of the problem' and, empirically, what may be called 'empirical vaporization'" (Sartori, 1991; Chap. 9; cfr. Collier & Mahon, 1993).

QCA detects these composites as trivial conditions and suggests they can be dismissed. However, composites may contain relevant explanatory information. The *inus* standing of selected components can be decided by their consistency to the outcome and by minimizations. In addition or as an alternative, suitable rules of composition by disjunction and conjunction may be devised to compress sub-properties into "superconditions" (Elman, 2005; Berg Schlosser & De Meur, 2009; Goertz, 2017; Damonte & Negri, 2019).

#### The Problem of Missing Values

Often, available raw measures are plagued with missing values. QCA's algorithm technique cannot handle them clearly, as the units for which the value is missing would belong to two primitives. This ambiguity can be tackled by running parallel analyses to verify whether the different classifcations result in different solutions. If not, the unit and its partial information would prove irrelevant. When different classifcations affect solutions—for instance, because they decide whether a primitive is realized or not—the information proves relevant, but the problem arises of how to decide between the two solutions.

Missing raw values require some credible criterion of adjudication. Alternatively, the measure can be substituted with a complete gauge of the same intension, if any. Last, the unit can be dropped from the analysis (Ragin, 2008; Basurto & Speer, 2012; Duşa, 2019). The move may increase the number of logical remainders, but remainders can be more adequately addressed with counterfactual rules in minimization.

#### **7.4.2.3 Identifying Membership Thresholds**

Thresholds explicate the rule that establishes a unit to be an instance of the set given its raw value. The default recommendation is to anchor these decisions on external theories and conventions (Ragin, 2000, 2007, 2008).

Special values of national and international policy indicators—for instance, household income to establish the risk of poverty; the share of people in an age cohort in education or training to expect a certain quality of society; the share of debt to revenue to establish the credibility of a borrower—may offer accepted anchorages to calibration decisions. However, conventional knowledge may evolve at a slower pace than actual phenomena. Under particular contingencies or within special areas, its usage for calibration may return skewed membership scores that would not survive the *RoN* test. Besides, a conventional tipping point may coincide with some units in the population, making them uninformative.

To avoid these issues, conventional knowledge can be adjusted in light of distributional considerations (Ragin, 2008). Although descriptive statistics lack qualitative meaning, considerations about quintiles seem unavoidable in large-N studies or whenever previous knowledge is wanting (e.g., Ragin & Fiss, 2017). A supplementary strategy—and consistent with the concern for non-contradictory partitions prescribes cluster analysis to identify the raw values to be used as thresholds. The underlying rationale maintains that units close to each other belong to the same partition—and hence, that thresholds lie in the "natural gaps" between clusters.

Although long offered as a standard function for threshold setting by many software packages (e.g., Duşa, 2019), cluster analysis has driven concerns that its application might convey a deceiving sense of certitude about calibration and solutions. The risk of overconfdence can also increase when the membership scores are assigned directly following one of the scales in Table 7.3. Indeed, the researcher's classifcation error can always affect scoring operations in unknown directions.

To keep the risk at bay, zooming into the units around a threshold can help to support decisions with empirical knowledge when the number of cases allows it (Ragin, 2000; De Block & Vis, 2019). Frontier literature has also developed on false negatives and false positives in solutions (Braumoeller, 2015; Rohlfng, 2018) and on alternative fltering functions (Thiem, 2010). A further strategy suggests ascertaining the "robustness" of the solutions by running parallel analyses under different

perturbations of units and thresholds (Marx & Duşa, 2011; Maggetti & Levi-Faur, 2013; Duşa, 2019; Oana & Schneider, 2018).

Many of these considerations are more justifed in exploratory than in explanatory applications of QCA. When the driving concern is the preservation of particular meanings, seldom different gauges can render it equally well. To witness, Ostrom's theory of corruption maintains that people's perception of ineffective monitors and sanctions drives the belief of diffused wrongdoing that invites resorting to corruption along the lines of a self-fulflling prophecy. In testing the tenability of this theory, the indexes of ineffciency in administration often used as a proxy of corruption are less suitable gauges of the phenomenon to be explained than the measures of perceived corruption.

In explanatory usages, however, coder's biases are possible, and this possibility can be explored by simulating some systematic tendencies toward strictness, generosity, confdence, or coyness in assigning membership scores. These tendencies can be rendered by calculating the *concentration* (7.29), *dilation* (7.30), *intensifcation* (7.31), or *moderation* (7.32) of the original fuzzy scores (Smithson & Verkuilen, 2006):

$$
\mu\_{i \in \mathcal{A}}^{Conc} = \mu\_{i \in \mathcal{A}}^{2.0} \tag{7.29}
$$

$$
\mu\_{l \in \mathcal{A}}^{Dil} = \mu\_{l \in \mathcal{A}}^{0.5} \tag{7.30}
$$

$$
\mu\_{i\in A}^{\text{lst}} = \begin{cases}
\mu\_{i\in A}^{0.5}, & \mu\_{i\in A} > 0.5 \\
\mu\_{i\in A}^{2.0}, & \mu\_{i\in A} < 0.5
\end{cases} \tag{7.31}
$$

$$\mu\_{i\in A}^{\text{mod}} = \begin{cases} \mu\_{i\in A}^{2.0}, & \mu\_{i\in A} > 0.5\\ \mu\_{i\in A}^{0.5}, & \mu\_{i\in A} < 0.5 \end{cases} \tag{7.32}$$

These transformations expose the worsening or the improvement that coders' biases can impart to solutions. They prove that truth tables and solutions inevitably change with scoring strategies—and the intensifcation, by bringing the fuzzy truth table closer to its crisp version, inevitably enhances the consistency and symmetry of observed primitives. In the end, the relative fragility of fndings mirrors the specifcity of our operationalization —but also its local value. It counts less as a problem of the technique or the algorithm than an issue in our knowledge, models, and gauging strategies.

#### **7.5 Summing Up**

To run a credible explanatory QCA, a researcher may want to


the plausible solution do not improve the *S.cons* values on the parsimonious, consider re-running the analysis from step 5 without these additional conditions to verify the robustness of minimizations.


You can fnd the example here https://doi.org/10.5281/zenodo.7117973. Enjoy your explanatory QCA!

#### *Suggested Readings*


#### **Review Questions**

Section 7.2


#### Section 7.3


#### Section 7.4


#### **References**


Smithson, M., & Verkuilen, J. (2006). *Fuzzy set theory: Applications in the social sciences*. Sage.


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 8 Causal Inference and Policy Evaluation from Case Studies Using Bayesian Process Tracing**

#### **Andrew Bennett**

**Abstract** Case studies enable policy-relevant causal inferences when experimental and quasi-experimental methods are not possible. Even when other methods are possible, case studies can strengthen inferences either as a standalone method or as part of a multimethod research design. The chapter outlines the case study method of process tracing (PT), which is a within-case mode of analysis that builds upon Bayesian logic to make inferences to the best explanation of the outcomes of single cases. The chapter locates the epistemological basis of PT in the development and testing of theories about the ways in which causal mechanisms operate to generate outcomes. It then defnes PT and outlines best practices on how to do it, illustrating these with examples of case study research on the COVID pandemic. The chapter then outlines the comparative advantages of PT vis-à-vis other methods, and identifes the kinds of research questions and research contexts for which PT is most useful. This leads to a brief discussion of two methodological innovations: formal Bayesian PT and the use of causal models in the form of Directed Acyclic Graphs to assist PT and integrate qualitative and quantitative evidence. The chapter concludes with the strengths and limits of PT.

#### **Learning Objectives**

After reading this chapter, you should be able to:


A. Bennett (\*)

Georgetown University, Washington, DC, USA e-mail: BennettA@Georgetown.edu

<sup>©</sup> The Author(s) 2023

A. Damonte, F. Negri (eds.), *Causality in Policy Studies*, Texts in Quantitative Political Analysis, https://doi.org/10.1007/978-3-031-12982-7\_8

#### **8.1 Introduction**

Policymakers often need to assess the likely outcomes of alternative policies. To do so, they frequently need to develop causal understandings of past outcomes in situations where few cases exist and experiments are not possible for ethical or fnancial reasons. Process tracing (PT), a technique of within-case analysis analogous to detective work or medical diagnosis, is a key method of causal inference in individual cases. The goal is to explain the outcome of a single case, and as in detective work, the researcher can build upon both "suspects" (theories that provide potential alternative explanations for the outcome of a case) and "clues" (evidence or diagnostic tests).

Case studies have a long history—implicitly, they have been the primary method for historians and political observers since the Greek historian Thucydides wrote his chronicles in the ffth century BC. Many case studies have been done without much methodological rigor, however, which has given case study methods a bad reputation in some felds of research. In the past two decades, methodologists in political science and sociology have greatly improved and systematized case study methods, particularly the method of PT. This includes efforts to both refne case study methods and disseminate them to researchers through organizations, such as the American Political Science Association's section on Qualitative and Multimethod work, and training programs, including those sponsored by the Institute for Qualitative and Multimethod Research (IQMR) at Syracuse University, the European Consortium for Political Research (ECPR), the Global School on Empirical Research Methods (GSERM) at the University of St. Gallen, summer schools at the University of Oslo and the University of Essex, and MethodsNet.

The present chapter gives an overview of PT and recent innovations in this method. It begins with a discussion of the epistemic assumptions of PT, building on Daniel Little's Chap. 2 in this volume. It then defnes PT and outlines best practices on how to do it, illustrating these with examples of case study research on the COVID pandemic. Next, the chapter assesses the comparative advantages of PT vis-à-vis other methods, including some of those addressed in the other chapters in this volume. This section also identifes the kinds of research questions and research contexts for which PT is most useful. The chapter then outlines two new developments in PT methods: formal Bayesian PT, and the use of causal models in the form of Directed Acyclic Graphs to assist in PT and to integrate qualitative and quantitative evidence. The chapter concludes with the strengths and limits of the method.

#### **8.2 The Epistemic Foundations of Process Tracing**

For policy purposes as well as academic theoretical progress, we need causal knowledge: what will be the outcome if we try policy X or if X happens in the world? Yet all research methods confront what has been called the "fundamental problem of causal inference": we cannot rerun history after trying policy X, or after X happens in the world, and observe the outcome in the absence of X, while holding all other variables and historical developments constant.

Although no method can fully surmount this problem, scholars have outlined four general approaches to causation and associated methodological approaches to causal inference: regularity, counterfactual analysis, manipulation/experiments, and the causal mechanism account (Brady, 2008; see Chap. 2). The regularity approach, which Henry Brady calls "neo-Humean" after the philosopher David Hume, focuses on what Hume called "constant conjunction," or what we now call correlation as the key to scientifc explanation (Brady, 2008). The well-known limitation of this approach is that correlation does not equal causation. Even when observational data is plentiful, and robust correlations convince us that some causal relationship probably exists, the nature of the process that generates the correlations may be unknown, and the direction of causation—does A cause B, or does B or the expectation of B cause A—is not always certain. Statistical analyses also face the "ecological inference problem": even if a correlation is causal, it does not necessarily explain any individual case in the population under study. A medicine could be helpful on average, for example, and at the same time be lethal to those who have an allergy to that medicine.

The counterfactual approach, and associated "potential outcomes" methods, posit that something is a cause if it satisfes the following: "if A then B, if not A then not B" (or, if not A then B does not happen in the same way, at the same time, or with the same magnitude). This defnition of causation is intuitively appealing as a kind of common-sense understanding of causation, but it is more a thought experiment than a method of inference because we cannot observe counterfactual outcomes. In addition, while counterfactuals offer an intuitively appealing account of causation, they are also intuitively unsatisfying, and a weaker guide to policy choices in other cases, if they lack some account of the process through which the observed outcome arose (and that through which the unobserved counterfactual could have arisen).

The manipulation or experimental approach works to get as close as possible to observing the counterfactual outcome. It does so by selecting a "control" case or unit (or many randomly selected control cases or units) on which no manipulation is performed, and comparing the outcome to that of a case or unit to the outcome of a case that is as similar as possible to the control unit except that it has been subject to some treatment (or if there are many randomly selected cases, a comparison is made to a randomly selected treatment group).

This gets around some of the limitations of observational statistical analyses, but experiments have many demanding requirements or assumptions that must be met to be internally and externally valid. By one account, 26 requirements must be met for an experiment to allow a valid causal inference, including that random assignment has been properly done, that the proper statistical test is applied, that the sample size is suffciently large, that there is no "compensatory rivalry" (which can happen if experimental subjects fnd out which group they have been assigned to and try harder to achieve a favorable outcome), and that there are no treatments that

occur apart from the specifc one under study (Cook, 2018). Even when these assumptions are met, an experiment may or may not get us much closer to understanding the processes that generate the observed outcome(s), which limits our ability to anticipate the scope conditions under which the causal relationship holds. In addition, for many important policy challenges, experiments are impractical, a point elaborated below. Even when feld experiments are possible or historical processes provide "natural" experiments with nearly random assignment of individuals to some "treatment," experiments outside of a controlled laboratory setting introduce many potential confounding variables that make it diffcult to satisfy the assumptions necessary for causal inference.

The fourth approach, focusing on causal mechanisms and their capacities, provides the epistemological basis for PT (see Chap. 2, herein). In one much-cited defnition, causal mechanisms can be thought of as "entities and activities, organized such that they are productive of regular changes" (Machamer et al., 2000). Causal mechanisms are the ontological entities in the world that generate the outcomes we observe, and we attempt to model these mechanisms with theories. This approach is consistent with and, in some sense, more fundamental than the others outlined above, as it includes a focus on the activities or processes that create correlations, that make experiments work, and that explain both actual and, if we could observe them, counterfactual outcomes. It is the regularity of causal mechanisms, or what some have called "invariance," that gives them explanatory power.1 Put another way, causal mechanisms cannot be "turned off" when the conditions that enable their operation exist.

Unlike some approaches to explanation, the causal mechanisms view rejects "as if" theoretical assumptions, or assertions that theories need not be consistent with more micro-level processes as long as these theories are predictively accurate "as if" their stated or implicit micro-mechanisms were true. In a causal mechanisms approach to explanation, theories must be consistent with the evidence at lower levels of analysis or smaller slices of space and time. We may, for pragmatic reasons, consider a simplifed theory adequate for some policy purposes even if it does not give details on micro-level processes, but we do so knowing that a theory that is more consistent with the details at the next level down has greater accuracy and might lead to more nuanced policy prescriptions. The 1960s theory that "smoking can cause cancer," for example, was suffcient for the public health policy advice "don't smoke," even though the detailed processes relating smoking to cancer were unknown at the time. We now have a more detailed theory about smoking and cancer that allows more precise policy prescriptions, such as "people with a mutation at a specifc region on chromosome 15 are at a particularly high risk of cancer if they smoke." Theories on macro-level social processes and outcomes can be useful, and for some purposes, it may be more effcient to do PT at the macro level, but if macro-level theories work through lower levels of analysis like individuals' choices,

<sup>1</sup> "Invariance," as used here, does not exclude probabilistic causal relations; it can include probabilistic relations that are in some way bounded (Waldner 2012, 2016).

they must still be consistent with the processes through which those choices are made to be considered as accurate as possible.

PT exploits this aspect of mechanistic explanations by generating and assessing evidence, sometimes in detailed slices of space and time, on the explicit or implicit processes hypothesized by alternative explanations for the outcomes of individual cases. It thus takes advantage of two sources of evidence and inference that Hume did not include as core features of his constant conjunction account: *contiguity* and*sequencing*. Contiguity gets at entities in spatial proximity, bumping into each other or exchanging information—in social phenomena, who said or did what to whom. Sequencing uses the order in which things happened to help make inferences to the best explanation of the outcomes of cases—although it can be empirically hard to tell which of two parties escalated a confrontation, for example, the order in which it happened matters in explaining the outcome.

The focus on evidence on hypothesized processes raises three challenges for PT: how far down must we go into the details of processes? when should we stop gathering evidence? and how far back in time should we go to provide adequate explanations? Unfortunately, while Bayesian logic, outlined below, provides answers to these questions, they are rather general: we stop pushing into more detailed observations, gathering additional evidence, or probing earlier points in time when we think it is unlikely that doing so will change our confdence in the likelihood of alternative explanations suffciently to be worth the effort it would entail. Put another way, process tracers balance two risks:


On a more pragmatic level, at some point, social scientists leave the study of more detailed social and psychological processes to other felds of study that have the skills and equipment to gather and assess evidence on these processes: cognitive psychology, neuroscience, microbiology, and so on. But we should—and do—pay at least some attention to the research at these lower levels of analysis because fndings inconsistent with our theories indicate that we need to modify those theories. In the felds of economics and political science, for example, numerous theories build on research2 that demonstrates how human decision-making often involves cognitive biases that depart from the assumptions of earlier rational choice models.

<sup>2</sup>Studies of the biological basis of emotions, and the effect of emotions on decision-making, are at an earlier stage of development, but are starting to gain notice in the social sciences as well.

#### **8.3 Process Tracing Best Practices and Examples from COVID Research**

#### *8.3.1 Defnition of Process Tracing*

PT is the gathering and "analysis of evidence on processes, sequences, and conjunctures of events within a case for the purposes of either developing or testing hypotheses about causal mechanisms that might causally explain the case" (Bennett & Checkel, 2015:7).

Bayesian logic is the underlying foundation of PT. Bayesianism in PT treats probabilities as degrees of belief in alternative explanations.3 In this approach, we use our existing background knowledge to form initial degrees of belief in alternative explanations of the outcome of a case (called the "priors"), and then analyze evidence to form updated degrees of belief, now conditioned on the evidence (called the "posteriors"). The relative probability of evidence under the explanations is called the "likelihood" (or, when comparing two explanations, the "likelihood ratio"). Bayesianism uses the laws of probability to convert the likelihood of the evidence conditioned on the explanations to the posteriors, or the likelihood of the explanations conditioned on the evidence.

In mathematical symbols, Bayes Theorem outlining this process of updating can be expressed as in Eq. (8.1):

$$Pr\left(P \mid k\right) = \frac{Pr(P)Pr\left(k \mid P\right)}{Pr(P)Pr\left(k \mid P\right) + Pr\left(\sim P\right)Pr\left(k \mid \sim P\right)}\tag{8.1}$$

where


A mathematically equivalent equation, known as the "odds," form Bayes Theorem, which in some ways in easier to work with, is as follows:

*Posterior Odds Ratio Likelihood Ratio Prior Odds Ratio*

<sup>3</sup> In frequentist statistics, by contrast, probability represents the limit of an event's relative frequency in many trials.

where the Likelihood Ratio is the probability of fnding evidence *k* conditional on *P* being true divided by the likelihood of *k* conditional on *P* being false. In the notation of probability, the equivalent equation reads as in (8.2):

$$\frac{\Pr\left(P \mid k\right)}{\Pr\left(\sim P \mid k\right)} = \frac{\Pr\left(k \mid P\right)}{\Pr\left(k \mid \sim P\right)} \bullet \frac{\Pr\left(P\right)}{\Pr\left(\sim P\right)}\tag{8.2}$$

An intuitive way to understand Bayesian logic is to think of the strength of evidence, or the relative likelihood of fnding a particular piece of evidence under alternative explanations. Evidence that is much more likely under one explanation than under another has high probative value. We already have a colloquial language for the strength of evidence (Van Evera, 1997: 31–32): evidence can constitute "smoking gun" tests, "hoop" tests, "doubly decisive" tests, or "straw in the wind" tests.


#### *8.3.2 How to Do Process Tracing*

A brief outline of how to do PT is as follows:


<sup>4</sup> In contrast to statistical methods, random selection of cases is inadvisable in small-n research, and it is best to select cases for study with at least preliminary knowledge of the values of their independent and dependent variables. Cases that are positive on an independent variable of interest and positive on the outcome of interest (positive-positive cases) present potential opportunities to examine whether and how/through what processes or mechanisms the independent variable generates the outcome. Positive-negative cases are cases in which a hypothesized variable does not lead to a positive outcome can clarify the scope conditions of that variable. Negative-positive cases show paths to the outcome that do not involve the independent variable whose value is negative. Negative-negative cases provide less useful information. One should not study nuclear weapons proliferation, for example, by looking at countries that have neither a nuclear power program nor a close ally that might share nuclear technology and that (unsurprisingly) do not have nuclear weapons.

update their degrees of confdence in alternative possible explanations for the outcome.

• Finally, weigh the totality of the evidence, including both strong and weak evidence, and update the prior estimate of each explanation's likelihood of being true to produce a new posterior estimate.

Thus far, this account outlines the deductive side of PT. In addition, PT has an inductive side. Any unanticipated evidence that appears to perhaps play a causal role but does not ft any of the candidate explanations might provide the basis for a new explanation of the case. When a researcher adds a new alternative explanation, it is necessary to re-estimate the priors of the revised set of explanations, re-estimate the likelihood of evidence under each explanation relative to the others, and reweigh the totality of the evidence to update the likelihood that each of the alternative explanations is true.

Bayesian logic in PT helps dispel a common misconception about the validity of different kinds of iterations between theories and evidence. Methodologists often argue that a researcher cannot develop a theory from a case and then test it against that same case. There is a good rationale for this injunction in frequentist statistical methods, as a theory derived from correlations found in a population sample cannot legitimately be tested against that same population sample, as the probability of disproving the new theory is zero. Using Bayesian logic in PT, however, makes it possible to derive a theory from a piece of evidence and then test that theory in the same case (Fairfeld & Charman, 2018). There are two reasons for this, one incontrovertible and one more contestable. The incontrovertible reason is that it is often possible to develop a theory from a case and then to test it against different, independent, and heretofore unexamined evidence from the case that could still prove the new theory to be wrong. Detectives and doctors do this all the time—a doctor might fnd one piece of diagnostic evidence that suggests a patient might be afficted by a disease the doctor had not previously considered, and this insight can lead to additional diagnostic tests on the same patient. If the new tests are based on biological relationships that are independent of the frst test, they can either affrm or disconfrm the new candidate diagnosis. It would be nonsensical to argue that the new diagnosis should be tested on a different patient to fnd out why the frst patient is ill.

The second rationale for developing and testing a theory in the same case is more ambitious and contestable—it argues that it is legitimate to derive a theory from a piece of evidence in a case and to claim that this *same evidence* can be a severe test of the theory. In Bayesianism, it does not matter whether one frst identifes an explanation and then assesses the likelihood of evidence under that explanation relative to rival explanations, or frst derives a theory from evidence and then assess the relative likelihood of that evidence vis-à-vis the new explanation and its rivals. Evidence that is consistent with one explanation and inconsistent with its rivals is strong evidence in favor of the explanation, no matter when or how the explanation was derived (Fairfeld and Charman, 2022). To use an analogy, if a detective thought an aggrieved business associate was the most likely suspect in a robbery, but then found a video recording of the crime scene showing a neighbor whom she had not previously suspected carrying out the crime, the very evidence that turned attention to the new suspect would also be powerful evidence for a conviction. The counterpoint to the unqualifed application of this view is that humans are subject to potential confrmation bias, and it may be harder to objectively assess the likelihood of less defnitive evidence under alternative explanations once the evidence is known to be true. Either way, Bayesian logic dictates that when we develop a new explanation or theory, we have to go back and re-evaluate all the evidence we gathered earlier, assessing its likelihood under the new theory in comparison to its likelihood under the theoretical explanations we had already considered.

#### *8.3.3 Best Practices in Process Tracing*

This chapter outlines, in the section below, on new and future developments, more recent and formal Bayesian ways of carrying out PT. Here, it turns to pragmatic advice about best practices in both informal and formal Bayesian PT. These practices are summarized in Table 8.1 (from Bennett & Checkel, 2015:21), and briefy elaborated below.

#### **8.3.3.1 Cast the Net Widely for Alternative Explanations**

It is important to consider a wide range of alternative explanations. Considering a few additional explanations that may quickly prove to be weak and deserving only of a footnote risks spending additional time and effort, but leaving out a viable explanation skews the analysis of the likelihood of the evidence and jeopardizes inferences from a case study. How do we know whether we have considered a

**Table 8.1** Best practices in PT


Source: Bennett and Checkel (2015)

suffciently wide range of alternative explanations? I present here several "checklists" of common sources of potential social explanations as a pragmatic guide.

First, we can look to "off-the-shelf" theories academics have applied to similar questions, participants' and stakeholders' explanations for events and outcomes, historians' and area and functional experts' explanations, and the implicit or explicit explanations offered by news reporters (Bennett & Checkel, 2015: 23).

Second, the literature on quasi-experiments and program evaluation identifes many general explanations to consider. These include the following5 :


A third checklist of explanations to consider includes four kinds of agent–structure relations: (1) agents affecting structures; (2) structures enabling or constraining agents; (3) agent to agent interactions; and (4) structure to structure relationships (like demographic change). These four kinds of agent–structure relations intersect with three broad families of social and political theories focused on (1) ideas/

<sup>5</sup>Many of these are discussed in Shadish et al. (2002); this same list is included, nearly *verbatim*, in Bennett, forthcoming.

identities/social relations; (2) material resources and incentives; and (3) institutional transactions costs/functional effciency. The resulting matrix encompasses 12 common kinds of theories. For example, the functional effciency family of theories includes agents emulating other agents whom they view as successful, structures selecting out effcient agents as in evolutionary selection, functional competition among agents creating market or balance of power structures, and structure to structure processes like adverse selection (see Bennett, 2013; Bennett & Mishkin, 2023, for elaboration).

It is important to note that the requirement for mutual exclusivity among candidate explanations is often misunderstood (Bennett et al., 2021, cfr. Zaks, 2020). Mutual exclusivity can always be set up by explanations that point to different independent variables as the primary or most important variable in determining the outcome—only one variable can be the main one. It can also take the form of explanations that draw on different variables, but this does not have to be the case. Mutual exclusivity does not require that explanations be monocausal, and it does not prohibit explanations that draw on some or even all of the same variables. Explanations can involve as many variables as a researcher wants, in any functional forms or relationships the researcher wants to specify. They can also use exactly the same variables but just pose different possible functional relations among them. For example, an internal combustion engine needs four things to function: fuel, oxygen, a spark, and compression. These same four things could produce failure to function in different combinations or functional relationships. It may be that an engine does not turn over because the spark plug and piston rings are both a bit worn, the fuel is low octane or has some contaminants, and the air intake is a bit clogged, in such a way that improving any one of these would be enough to get the engine to turn over. Or maybe, two of these components are fne and two are just faulty enough that together they prevent the engine from turning over.

In addition, the aspiration or claim to have achieved an exhaustive set of alternative explanations is always provisional. We can never be sure that the candidate explanations are exhaustive because it is always possible that the true explanation is one we have not considered or discovered. We cannot include an explanation we have not conceived. This is one reason that Bayesians are never 100% confdent that they have identifed the correct explanation for an outcome.

#### **8.3.3.2 Be Equally Tough on the Alternative Explanations**

It is tempting to pick a "favorite" explanation early in a research project, but it is important to resist this temptation, as it can lead to confrmation bias. The alternative explanations should be plausible—if they are not plausible, they need to be reformulated or other explanations need to be considered. One of the ways that rigorous methods work is that they help us, or even force us, to guard against our own confrmation biases.

In PT, this takes the form of thinking through the observable implications for *all* of the hypotheses. This includes asking for each explanation "what would be the observable implications about the process and sequence in the case if this explanation is true"—a question that comes naturally due to the way our brains work. It also includes asking "what would be true if this explanation is false"—a question we might overlook if PT methods did not require us to address it.

It is also important to do PT in relatively equal depth on each of the alternative hypotheses. Otherwise, there is an inclination to favor one hypothesis or another and to keep looking for confrming evidence for that explanation until you fnd it, and to stop looking for PT evidence on the alternative explanations after fnding one or a few pieces of evidence that make them less likely.

#### **8.3.3.3 Consider the Potential Biases of Evidentiary Sources**

Documentary records can be biased by the preferences or instrumental goals of the people who made them regarding what they want to record, keep, and make available. Interviewees can have instrumental goals or motivated biases as well. They can also have unmotivated biases—recalled memories can be accurate, and the interviewee may have had access to some information streams and not others at the time of the events being studied. One way to take such potential biases into account is to discount the weight of evidence that could be subject to these biases.

#### **8.3.3.4 Consider Whether the Case Is Most or Least Likely for Alternative Explanations**

This recommended practice relates to the estimation of the case-specifc priors on the alternative explanations.

When an explanation has a high prior (a most-likely case), but there is strong evidence in the case that the explanation is not correct, this might not only affect our explanation of the case at hand—it might lead us to narrow the scope conditions of the failed explanation and lower its prior for similar cases. Conversely, if the evidence from a case strongly supports an explanation that had a low prior, this might lead us to widen the scope conditions of this explanation and increase its prior for similar cases.

It is also useful at times to pick cases in which some of the explanations usually offered for the kind of case being studied simply cannot apply because their key variables or enabling scope conditions were not present. This can simplify the PT on such cases as it reduces the number of explanations on which PT is necessary.

#### **8.3.3.5 Make a Justifable Decision on When to Start**

As discussed above in the section on epistemology, there is no general rule for selecting the temporal starting point for a case study. Often, it is useful to start at a critical juncture at which a key choice was made among alternative policies or at which a strong exogenous shock occurred. But the choice of a temporal starting point also depends on whether we want to study deep, structural, and often, slowmoving causes or shorter-term, proximate causes that often relate more to agency than to structures.

Either way, the researcher must balance the costs and risks of going too far back in time, which increases the time and effort required for the PT, versus those of not going suffciently far into the past, which risks overlooking important earlier causes that set in motion later mediating causes that explain less of the variation in outcomes across cases.

#### **8.3.3.6 Be Relentless in Getting Diverse Evidence, but Make a Justifable Decision on When to Stop**

Here again there is no precise general rule: the researcher must balance the costs and risks of stopping the collection of evidence too soon, when a little more evidence could have greatly changed our confdence in the explanations, versus those of stopping too late, which leads to wasted time and effort and little additional updating on the alternative explanations.

Bayesian logic adds a little more specifcity to this broad advice, as it indicates that after you have examined a lot of the same kind of evidence, each additional piece of that kind of evidence has a low probability of surprising you or pushing you to update your beliefs on the likelihoods that alternative explanations are true. This is because similar evidence has already been taken into account or used for updating. However, different kinds of evidence that have not been so exhaustively examined are more likely to lead to signifcant updating on the alternative explanations.

#### **8.3.3.7 Combine PT with Case Comparisons if Relevant**

While PT is a within-case method, it can be fruitfully combined with comparative case studies to strengthen causal inferences and clarify the scope conditions of explanations. A particularly powerful combination is the use of PT on "mostsimilar" and "most-different" cases.

Most-similar cases are the same (or at least roughly the same)6 in the values of all but one of the independent variables and they have different values on the dependent variable. This provides some evidence that the difference on the one independent may cause the difference on the dependent variable, but this inference is provisional, since there may be other potentially causal factors that differ between the two cases and that are not included among the independent variables. It is thus useful to apply PT both to assess whether there is a pathway through which the

<sup>6</sup>Fully similar comparisons (comparisons between cases with roughly similar values on all the independent variables and on the dependent variable) are analogous to the "coarsened exact matching" that some quantitative methods use. See the Chap. 4 herein.

value on the independent variable that differs leads to the outcomes of the two cases and to assess whether the other potentially causal factors that differ do not lead to or cause the outcomes.

Conversely, a least similar case comparison involves two cases with the same value on the dependent variable and only one independent variable that has the same value. Here, PT can assess whether the common independent variable leads to the outcomes and whether other shared potentially causal factors do not.

#### **8.3.3.8 Be Open to Inductive Insights**

PT is most effcient when the researcher frst develops a set of candidate explanations as described in (1) above and identifes their observable implications and the associated evidence to gather. The deductive effort this requires is quick and inexpensive compared to the feld, interview, or archival work of actually gathering of the evidence. At the same time, it is important to remain alert for evidence that suggests possible causal processes not included in the initial set of explanations.

The feeling of puzzlement or surprise at an unexpected or unanticipated piece of evidence can lead to the development of a new explanation of a case for which the researcher can identify new observable implications on which to seek evidence. For this reason, it is often useful to do some initial open-ended research on a case—a process that some have called "soaking and poking"—as researchers immerse themselves in a case.

This is not the same as trying to approach a case without preconceptions, as some suggest in the grounded-theory or other traditions7 : soaking and poking is still preceded by developing a set of theories and unexpected evidence emerges against the background of those theories. In other words, we recognize it as puzzling because it does not ft any of our candidate explanations well. In practice, there can be many iterations between the explanations and the evidence (Fairfeld & Charman, 2018).

#### **8.3.3.9 Use Deduction to Infer What Must Be True if a Hypothesis Is True**

While deductively deriving the observable implications of a theory is fast and easy compared to gathering evidence, it is still challenging and contestable. Theories are usually not suffciently detailed to immediately identify their observable implications in a particular case. This means that researchers and their readers or critics will not always agree on what the observable implications are for an explanation.

<sup>7</sup>While scholars in the grounded theory approach recognize that approaching a case without preconceptions is impossible, as our minds are pre-ordered by all kinds of theories and experiences, they nonetheless urge trying to do so as much as possible. The standard advice in the process tracing approach is to instead develop and be explicit about candidate explanations, drawing on the sources identifed above, and use them to decide which evidence to look for.

The best that a researcher can do here is to be clear and explicit about the implications they derived from a theoretical explanation and the logic through which they derived them. It is also possible to entertain alternative readings of the implications of a theory, and to factor into the conclusions whether some or all of these proved true. If the evidence was consistent with both of two possible interpretations of a theory, for example, then the theory is likely to be true regardless of which interpretation one uses.

To identify observable implications, it is necessary to mentally inhabit the hypothetical world in which the explanation is true and imagine very concretely the specifc steps, sequences, and processes through which the explanation's independent variable(s) could have generated the outcome.8 Often, researchers are not suffciently concrete and specifc in thinking about who should have said or done what to whom when if an explanation were true. There can also be functionally equivalent substitutable steps at different points in the hypothesized process. If possession of a gun was necessary for a suspect to have committed a crime, for example, evidence that the suspect had purchased a gun is equally informative no matter whether the gun was paid for by check or credit card.

#### **8.3.3.10 Remember Not All PT Is Conclusive**

A fnal injunction is to remember that not all PT is conclusive. Whether it is highly conclusive depends on whether the evidence is much more likely under one explanation than under the others, and this cannot be known beforehand. In addition, even when the evidence does greatly raise the likelihood that one explanation is true, there is always some possibility that an even more accurate explanation never occurred to the researcher.

For these reasons, process tracers can never be 100% certain, and it is important to be clear about any uncertainty that remains after analyzing the evidence. In the formal Bayesian PT approach described below, this takes the form of specifying the posterior on each hypothesis in terms of an explicit probability or range of probabilities.

#### *8.3.4 Examples from COVID Case Studies*

While laboratory studies on the COVID-19 coronavirus have led to a rapid accumulation of knowledge about its biochemistry, case studies using a PT logic have been vitally important in learning about its transmission in real-world settings, where experiments are not possible. When COVID-19 frst emerged as a public health

<sup>8</sup>Fairfeld and Charman (2017) suggest this practice of mentally inhabiting the world of a hypothesis to help assess the likelihood of evidence under that hypothesis; it is also useful in deciding what evidence to look for in the frst place.

concern, doctors, scientists, and government offcials had limited knowledge of how the disease spread. It is easy in this instance to construct mutually exclusive and exhaustive means of transmission: (1) airborne inside only; (2) airborne inside and outside; (3) airborne inside plus transmission via common contact surfaces; or (3) airborne inside and outside plus infection through contact surfaces.9 Epidemiologists had a range of views on what prior likelihood they should assign to each hypothesis, but in the end the priors did not matter much because powerful evidence emerged that was much more likely under explanation 1, rather than under explanations 2–4, as by far the most common means of transmission.

A key early case study came from a restaurant in Guangzhou, China, where one patron who had COVID dined on January 24, 2020 with three family members. Two other families dined at adjacent tables. Within 5 days, nine members of the three families developed COVID, with no other known exposures apart from the restaurant and subsequent within-family transmission. Close study of the restaurant seating revealed that, outside of the index patent's family, only those in the airfow path of the air conditioner that blew air across the table of the index patient developed COVID, while none of the other 83 restaurant patrons or eight staff developed COVID. The authors of a study on this case concluded that droplet transmission in the air-conditioner airfow was likely the key transmission mechanism, and recommended improved ventilation and greater table distancing in restaurants. The absence of any cases among the restaurant staff who handled the index patient's dirty dishes can be considered a failed smoking-gun test: it slightly reduces the likelihood of transmission of coronavirus through contact with surfaces of objects (Lu et al., 2020).

A later case study of a superspreader event at a choir practice in March 2020 underscored the danger of air transmission inside. Of the 61 people who attended the 2.5-hour practice, including one symptomatic index patient, 32 confrmed and 20 probable secondary COVID-19 cases occurred. The study concluded that close proximity and the act of singing led to high rates of transmission (Hamner et al., 2020).

The most defnitive case study of COVID transmission, however, came from an event that provided a strong natural experiment (Shen et al., 2020). In January 2020, 128 people took two separate buses with recirculating cooling units (60 people in the frst bus and 68 in the second, including a symptomatic index patient in the second bus) on a 100-minute round trip ride to a 150-minute event. Another 172 individuals attended the event but did not travel on either bus. None of the attendees wore masks. At the event, participants attended a morning service outdoors, followed by a brief lunch inside. They then returned to the same bus that had brought them, and took the same seats. Within days, 23 people on the second bus developed COVID, none of the passengers of the frst bus developed COVID, and another

<sup>9</sup>While some lung diseases, like Legionnaire's disease, can grow in bodies of water and then most commonly infect people through inhalation of contaminated aerosols, and other diseases like Ebola are transmitted by direct contact with bodily fuids, early cases of COVID and its similarity to other coronaviruses strongly suggested transmission by air and possibly also by contact surfaces.

seven individuals who were in close contact with the index patient at the ceremony or lunch but who had not ridden by bus developed COVID. Passengers seven rows behind the index patient on the bus developed COVID, while passengers next to windows that could be opened had lower rates of infection. This case provided further smoking gun evidence of air transmission in long exposure indoors, including transmission by small and relatively far-traveling aerosol droplets as well as heavier droplets. Later studies concluded that while transmission through surface contacts could not be ruled out, and that cases of such transmission have been reported when individuals touched an object that had been sneezed or coughed upon by a COVID patient, the odds of catching COVID were approximately one case for every 10,000 surface contacts (CDC, 2021). Similarly, while the bus study did not discuss outdoor transmission and such transmission could not be ruled out due to the seven individuals who developed COVID without riding a bus, the rarity of confrmed cases of outdoor transmission has reportedly led many experts to conclude that such cases constitute only 1% of total cases and perhaps as low as 0.1% (Leonhardt, 2021).

A fourth case study indicates the high effcacy of mask-wearing to prevent COVID transmission. This study focuses on two hair stylists in Missouri who contracted COVID in 2020. While these individuals were symptomatic, they were in proximity to 139 patrons indoors. All wore masks, and none of the patrons developed COVID (Hendrix et al., 2020).

Although these four studies use the logic of PT implicitly rather than explicitly, their conclusions follow Bayesian logic. The authors intuitively used the likelihood of evidence under alternative explanations, together with the laws of probability, to update views of the likelihood of alternative COVID transmission paths in light of the evidence.

The chapter turns in the penultimate section to new methodological developments and the question of whether using the Bayesian logic of PT more formally and explicitly improves inference to the best explanation.

#### **8.4 The "Replication Crisis" and the Comparative Advantages of Process Tracing Case Studies**

#### *8.4.1 The Replication Crisis*

In the last 15 years, concerns over a "replication crisis" have swept through the social and medical sciences and the policy analysis and program evaluation communities. The crisis centers on the concern over high rates of failure in attempts to replicate peer-reviewed research fndings in medicine and the social sciences, including those based on experiments as well as observational statistical studies. This does not necessarily mean that studies whose fndings cannot be replicated are wrong—there are many reasons it may not be possible to replicate a study or its fndings, including changes in the historical context that make it impossible to recreate the same sample as that in the original study. Yet there is also evidence that such sample differences do not account for much of the variation in results found in replication failures (Klein et al., 2018). In addition, there are well-known methodological problems that can lead to false or overly confdent conclusions that could account for the high rate of replication failures of published research. These problems include publication bias (papers supporting their hypotheses are published at a higher rate than those that do not and a higher rate than studies with null fndings), "*p*-hacking" (manipulation of experimental and analysis methods, possibly unwitting, that artifcially produces statistically signifcant results [see Chap. 4 herein, especially Sect. 4.2.3, on the model dependence of statistical analyses]),10 "*p*fshing" (seeking statistically signifcant results beyond the original hypothesis), and "HARKing" (Hypothesizing After the Results are Known, or *post-hoc* reframing of experimental intentions to ft known data).

One result of the replication crisis has been renewed emphasis on lab experiments, feld experiments, natural experiments, regression discontinuity designs, and other research designs that attempt to allow causal identifcation. Even though experiments are among the methods that have experienced replication problems, and even though they have very demanding requirements and assumptions (especially feld experiments: Cook, 2018), properly done experiments are less subject to some of the methodological limits of observational statistical studies. "Natural experiments," or real world situations in which samples of a population are assigned to or end up in two different contexts or "treatment" conditions in a way that is random or close to random, can also be powerful. Another approach that has generated increased attention is regression discontinuity designs, in which the investigator compares samples of a population just above and just below a threshold that is a cutoff at which a treatment, such as class size in public schools, is assigned (see Chap. 3 herein).

These experimental and quasi-experimental methods all have important roles to play in policy-relevant causal inferences. Researchers and journal editors have also taken steps to address the problems associated with the replication crisis. Preregistration of research designs, for example, limits the risk that researchers might unintentionally make so many modifcations to their models that one model will produce a high degree of ft just by chance. Public repositories for data and replication materials are making research more transparent. Researchers have become more transparent about the assumptions behind instrumental variable and regression discontinuity designs and the conditions under which these achieve internal,

<sup>10</sup>The *p*-value, or probability value, tells you how likely it is that your data could have occurred under the null hypothesis. In other words, it tells you the probability of obtaining a test statistic as extreme or more extreme than the one calculated by your statistical test under the assumption that the null hypothesis is correct. It gets smaller as the test statistic calculated from your data gets further away from the range of test statistics predicted by the null hypothesis. A *p* level of 5% has by convention been considered in many journals to be the threshold for publishing results: this means, however, that there is still a 5% chance to see a test statistic at least as extreme as the one you found if the null hypothesis was correct.

statistical, and external validity (see Chap. 3 herein, especially Sect. 3.3). Some journals are carrying out replications before publication. Matching techniques (see Chap. 4 herein) and out-of-sample testing have become more common, and some journals have de-emphasized p-values in favor of a broader range of measures of the robustness of quantitative results, or moved to *p*-values of 1% rather than 5% as the standard for publication.

Still, even with improved practices, experimental and quasi-experimental methods have limits that are different from those of PT. For many problems of interest to both scholars and policymakers—wars, epidemics, economic crashes, etc.—these methods can be subject to practical and ethical constraints and problems of internal or external validity. Lab experiments are quite different from real world conditions. Field experiments on large-scale phenomena that involve potential harm are unethical, and other kinds of feld experiments may be prohibitively costly or operationally impossible. Natural experiments require a level of "as-if random" assignment to "treatment" and "control" groups that is rarely fully met except in studies of lottery winnings (Dunning, 2015). Regression discontinuity designs, as well as feld and natural experiments, have the challenge of assessing potential confounding variables. In addition, all population-level analyses face the ecological inference problem.

Because case studies using PT have a different set of comparative advantages from those of experimental and quasi-experimental research designs, they are useful as both a standalone method and as a complement to these other methods in multimethod designs. Most obviously, PT is useful when policymakers are interested in understanding causation in individual cases. PT can be especially useful in studying deviant cases, or cases that do not ft existing theories, and inductively deriving and then assessing new potential explanations. But PT case studies are not just for situations in which we want to explain outcomes in one or a few cases, or when only a small number of cases exist. Even when there is a large and relatively homogenous population available for statistical or experimental study, case studies can help get closer to causal mechanisms, examining how they work down to small slices of space and time.

#### *8.4.2 Process Tracing on Complex Phenomena*

In addition, PT is useful for assessing various kinds of complexity. These include the following:

• *Endogeneity*. Endogeneity arises when there are feedback loops between the dependent and independent variables and when the direction of causation (*X* → *Y* versus *Y* → *X*) is unclear. In this regard, PT helps untangle the direction of causation by focusing on the sequence of events. This helps with the assessment of which events or pieces of information came frst, and what events actors may have anticipated when they took action.


#### *8.4.3 Process Tracing in Multimethod Research*

PT can also be combined with other methods. One useful approach is to carry out a statistical analysis on observational data and then process trace one or a few cases to see if the hypothesized mechanisms that might explain population level correlations are evident in individual cases (Lieberman, 2005; Small, 2011). Statistical analysis can help identify outlier or deviant case, and PT on these cases may help identify omitted variables (Bennett & Braumoeller, 2022). In natural experiments, PT, on the ways in which different individuals or groups are "assigned" to or end up in the "treatment" and "control" groups, can help assess the validity of the assumptions of "as-if random assignment," unbiased dropout rates, and no unmeasured confounders (Dunning, 2015). PT can be combined with Qualitative Comparative Analysis as well, helping to identify the potentially causal processes that generate the outcomes of individual cases (Schneider & Rohlfng, 2013).

#### *8.4.4 Process Tracing and Generalizing from Case Studies*

One alleged limitation of PT case studies is their supposed inability to generalize from their results, or to achieve external validity. This issue has often been misunderstood, however (George and Bennett, 2005; Bennett, 2022). "Average treatment effects" are not the only way to conceptualize generalization, and they are not always the most useful ones. The "average treatment effect" of being born, for example, is having 1.5 X chromosomes and 0.5 Y chromosomes, an outcome that does not exist for any single person. Sometimes it is useful instead to have narrow

but strong "contingent generalizations," or generalizations that apply to only a few cases or to a specifed subset of a population, such as cases that share similar values on the independent variables and the dependent variable.

Single and comparative case studies using PT may or may not allow contingent generalizations. It is impossible for a researcher to know whether and to what population or scope conditions the fndings of a case study will generalize before they have developed, perhaps partly inductively, a satisfactory explanation of the case. The understanding of the causal process that emerges from PT in a case study, together with theoretical intuitions on the scope conditions in which it operates and background knowledge on the frequency with which those conditions arise, is what determines whether, where, and how a case study's fndings might generalize. Charles Darwin, for example, studied several bird species on remote islands and came away with the theory of evolution, whose scope conditions include all living things. Conversely, imagine discovering that a voter favored a candidate not because of party affliation, ideology, or any of the usual reasons, but because the candidate was the voter's sister-in-law. This would only generalize to the relatives of candidates, or perhaps more loosely to social relations not ordinarily considered to be important to voting decisions (and some voters might vote against their in-laws despite sharing their party affliations and policy views!).

In addition, the understanding of causal mechanisms that emerges from PT on a case, to the extent that this understanding is accurate, may generalize not only to similar cases or populations but to populations and contexts different from those of the case study at hand. As noted above, Darwin's theory of evolution applied not only to birds but to all living creatures. This is different from testing or applying a theory to an out-of-sample subset of a population, as is sometimes done in statistical analyses; it is applying a theory to an out-of-population case or sample.

#### *8.4.5 Limitations of Process Tracing*

The limitations of PT correspond with the strengths of experimental, quasiexperimental methods and studies using statistical analyses of observational data. PT does not produce estimates of average effects, or correlation coeffcients of independent variables. PT can shed light on how or through what mechanisms independent variables generated outcomes, but its inferences are more provisional and do not necessarily produce as confdent an answer as randomized controlled experiments on whether a variable has an effect on the outcome.

#### **8.5 New Developments in Process Tracing**

Two new methodological developments are pushing the frontiers of process tracing. Both developments are outlined in forthcoming books, and both are rather technical and complex, so this chapter provides only a brief overview of each.

#### *8.5.1 Formal Bayesian Process Tracing*

Tasha Fairfeld and Andrew Charman have worked out several methodological challenges to develop procedures for formal Bayesian PT (Fairfeld & Charman, 2017; Fairfeld & Charman, 2022). In formal Bayesian PT, researchers develop explicit numerical priors, between 0 and 1 or 0% and 100%, on the likelihood that alternative explanations are true (these could be ranges between high and low bounds, rather than point estimates). They also identify explicit numerical likelihood ratios for evidence conditioned on the alternative theories (which, again, need not be point estimates), and use these, together with Bayesian analysis of the collected evidence, to arrive at numerically explicit posterior estimates on the likelihood that alternative theories are true. Estimates of priors can be based on background information, on crowd-sourcing, or on a principle of indifference that assigns equal prior probability to all explanations. Estimates of likelihood ratios of evidence come from the theoretical logic of the alternative explanations. Researchers can check on the robustness of the posterior estimates by trying different distributions or ranges of priors and likelihood ratios.

One useful innovation that Fairfeld and Charman introduce is the use of a logarithmic scale for the likelihood ratios of evidence. This simplifes the math, as logarithms allow adding the weight of different pieces of evidence rather than using multiplication. In addition, logarithmic scales, such as the decibel (*db*) scale, refect the ways in which humans experience stimuli such as light or sound. It is intuitively easy to ask if a piece of evidence is "whispering" (30 *db*), "talking" (60 *db*), "shouting" (70–80 *db*), or "screaming" or above (90+ *db*) in favor of one explanation or another. After assigning logarithmic weights to how much each piece of evidence argues in favor of one explanation vis-à-vis another, the researcher can simply add up all of the weights to arrive at posterior estimates, just as if adding weights on a scale.

A common misunderstanding here is that the number of necessary comparisons of theories vis-à-vis the evidence becomes combinatorially large as the number of explanations grows (Bennett et al., 2021; cfr Zaks, 2021). This assumes that the likelihood for each piece of evidence under every hypothesis must be compared directly to that of every other hypothesis. In fact, it is necessary only to compare the likelihood of each piece of evidence for one explanation to that of each of the other explanations, and this implicitly compares the likelihood of the evidence under all the explanations to each other. By way of analogy, one could weigh a watermelon in terms of strawberries, and then weigh all the other fruits in a store in terms of strawberries, and this would provide the relative weight of every fruit in terms of either watermelons or strawberries.

Formal Bayesian PT has the advantage of making explicit all the judgements that are made implicitly in informal PT. This clarifes where and why an author and their readers or critics might disagree: they could disagree on the priors, on the likelihood of evidence, or on the reading of the evidence itself (one person may think a person interviewed in a research project is untruthful, for example, and another may not). Despite the advantages of formal Bayesian PT, however, its advocates do not

recommend doing it fully on every piece of evidence for every hypothesis. Doing so requires an unrealistically long and tedious write-up of research results. Researchers may fnd it useful, however, to carry out full formal Bayesian analysis on a small number of pieces of evidence that they consider to be the most powerful in discriminating among the hypotheses. In addition, even though it is inadvisable to fully carry out and write up formal Bayesian PT, the demonstration that it is in principle possible to do so, and the explication of the logic of doing so, help guide the reasoning of informal or partially formal Bayesian PT.

#### *8.5.2 New Modes of Multimethod Research*

A second innovation, in an article and a forthcoming book by Macartan Humphreys and Alan Jacobs, also builds on Bayesian logic and moves in a compatible but different direction. Humphreys and Jacobs use formal causal models, in the form of Directed Acyclic Graphs (DAGs), to help identify the hypothesized probabilistic dependencies among variables that enter into PT (Humphreys & Jacobs, 2015; Humphreys and Jacobs 2023; on DAGs, see also Chap. 6 herein). These authors argue, as the present chapter has, that design-based inferential approaches like experimental and quasi-experimental methods cannot be carried out on many questions that interest both policymakers and scholars, and that these methods can sometimes provide information on effect sizes without clarifying the underlying models or mechanisms. Consequently, Humphreys and Jacobs focus on model-based inference rather than design-based inference.

DAGs are models that formally represent theories in ways that make these theories' assumptions about mediating, moderating, and potential confounding variables clear and precise. Put another way, DAGs are graphical representations of Bayesian networks. Mediators are variables along the hypothesized causal path between an independent and dependent variable, so they help explain how the independent variable affects the dependent variable. Moderators are variables that affect the relationship between an independent variable and the dependent variable—they can strengthen, weaken, or negate that relationship. Confounders are variables that affect both the value of an independent variable and that of the dependent variable in a causal model, making it hard to estimate the true effect of the independent variable.

Humphreys and Jacobs argue that the core logic of their approach is most closely connected to PT and Bayesian inference, and they maintain that formally representing theories as DAGs helps guide methodological choices in both PT and quantitative analysis in ways that modify some traditional advice about how to carry out PT. Contrary to some earlier advice on case selection, for example, they argue that model-based inference demonstrates that for many inferential purposes "on the regression line" cases, or cases in which the outcome of interest occurred, are not necessarily the most informative. Optimal case selection, in their view, depends on the population distribution of different kinds of cases and the probative value of the available evidence. They also argue that the focus on intervening causal chains (mediators) in PT can sometimes be less productive than examining moderating conditions (moderators). Finally, DAGs can inform choices in multimethod work between breadth (how many cases to study) and depth (how intensively to study individual cases).

More generally, Humphreys and Jacobs argue that their approach dissolves the usual distinctions between qualitative and quantitative research, and that it can address and integrate case level and population level queries.

#### **8.6 Conclusions**

PT methods have many uses and comparative advantages. Unlike experimental and quasi-experimental and statistical methods, they can develop inferences on alternative explanations of individual cases. As PT is always on observational evidence in single cases, its scope is not as limited by cost, ethical concerns, or availability as experiments or quasi-experiments (although, to the extent that PT involves human subjects research such as interviews, it can raise ethical issues that require approval from an Institutional review board). PT brings causal inference close to the operation of causal mechanisms, sometimes in relatively small slices of space and time. While it is the only method (other than ethnographic methods) that is possible when one or a few cases exist, it is still useful for illuminating the operation of causal mechanisms and assessing the assumptions behind other methods even when large or randomly assigned populations are available for study. It can therefore contribute to multimethod projects involving statistical, experimental, and quasi-experimental methods.

At the same time, PT has several limitations and poses a number of research challenges. Collecting the necessary evidence can be laborious and time-consuming, and the conclusions can only be as strong as the evidence allows. Identifying the observable implications of alternative explanations requires careful thought, and scholars might not agree on what rather general theories imply about such implications in particular cases. PT case studies may allow strong contingent generalizations, or they may not. More broadly, just as the strengths of PT arise in areas where quantitative methods are weak, PT is weak where these other methods are strong. PT does not produce estimates of average effects, or correlation coeffcients of independent variables. It can shed light on *how* or through what mechanisms independent variables generated outcomes, but its inferences do not necessarily produce as confdent an answer as randomized controlled experiments on *whether* a variable actually had any effect on the outcome. Yet precisely because the strengths and weaknesses of PT and quantitative methods offset each other, there is great value in combining these approaches in multimethod research.

Recent innovations by Fairfeld, Charman, Humphreys, and Jacobs hold great promise for continuing the recent and rapid improvement of PT methods and practices.

These authors' ambitious innovations are at the cutting edge of PT techniques. As such, they have thus far been of interest mostly to methodologists and have not yet had a chance to be taken up by the much larger community of case study researchers. In short, although PT methods and practices are in some senses thousands of years old, they will continue to develop.

#### **Review Questions**


#### **References**


Humphreys, M., & Jacobs, A. (2023). *Integrated inferences*. Cambridge University Press..


#### *Suggested Reading*


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 9 Exploring Interventions on Social Outcomes with In Silico, Agent-Based Experiments**

#### **Flaminio Squazzoni and Federico Bianchi**

**Abstract** Agent-Based Modeling (ABM) is a computational method used to examine social outcomes emerging from interaction between heterogeneous agents by computer simulation. It can be used to understand the effect of initial conditions on complex outcomes by exploring fne-grained (multiple-scale, spatial/temporal) observations on the aggregate consequences of agent interaction. By performing *in silico* experimental tests on policy interventions where *ex ante* predictions of outcomes are diffcult, it can also reduce costs, explore assumptions and boundary conditions, as well as overcome ethical constraints associated with the use of randomized controlled trials in behavioral policy. Here, we introduce the essential elements of ABM and present two simple examples where we assess the hypothetical impact of certain policy interventions while considering different possible reactions of individuals involved in the context. Although highly abstract, these examples suggest that ABM can be either a complement or an alternative to behavioral policy methods, especially when understanding social processes and exploring direct and indirect effects of interventions are important. Prospects and critical problems of these *in silico* policy experiments are then discussed.

#### **Learning Objectives**

By studying this chapter, you will:


F. Squazzoni (\*) · F. Bianchi

Department of Social and Political Sciences, University of Milan, Milan, Italy, Via Conservatorio 7, 20122 e-mail: faminio.squazzoni@unimi.it; federico.bianchi1@unimi.it

#### **9.1 Introduction**

Behavioral science methodology, including randomized controlled trials (RCTs), is increasingly being used in public policy as a gold standard to estimate causal relationships between interventions and outcomes (e.g., Shafr, 2012; Straßheim & Beck, 2019). Examples of behavioral policies, from public health to education, have shown the malleability of individual preferences and decisions, as well as the sensitivity of targeted individuals to cognitive frames in responding to policy interventions (Galizzi & Wiesen, 2018). The profound non-linear relationships between policy stimuli and observable and measurable people's responses, which impinge the mantra of 'big stimuli vs. big outcomes' of conventional policy (Squazzoni, 2014), has suggested that if well-conjectured and 'incentive compatible', even minimal interventions could cause large-scale outcomes (Dolan & Galizzi, 2014).

The reason why RCTs are considered the "gold standard" in behavioral policy is that random assignment of a representative, targeted population to control and treatment groups, differing only in their manipulated conditions and the identifcation of any controllable, salient confounding factors by *ex ante* design, are instrumental to estimate causal effects. However, besides fundamental criticism on the often neglected infuence of implicit assumptions on unobservable processes in research design (e.g., Imai et al., 2008), the use of experimental methods for public policy has also important pragmatic limitations.

On the one hand, whenever feasible, RCTs for public policy purposes could have a negative beneft-cost ratio. Indeed, ethical obstacles can prevent group selection or the exploration of conditions that would introduce inequality and negative externalities for certain groups. Secondly, economic costs are often severe even for smallscale pilots. Furthermore, the intrusive, 'outside-in' nature of experimental policies can affect real-life outcomes and people's behavior in other domains beyond any intended purpose. This is indeed a fundamental problem: not only do people often react unpredictably and adaptively to interventions (note that this has been a key argument for supporters of behavioral policies against the traditional policy framework based on positive/negative incentives and 'rational' response), individuals are also embedded in social contexts so that their exposure to policy treatments can trigger positive and negative network externalities or knowledge spillovers, which might also affect outcome measurements (Dolan & Galizzi, 2015; Squazzoni, 2017). Disentangling any established causal effect between interventions and outcomes in such situations is diffcult.

Finally, as suggested by Battistin & Bertoni in Chap. 3, inferences on causal effects of policy interventions would require counterfactual procedures to assess what would have happened to the estimated outcomes had these interventions not taken place. Besides the diffculty of isolating a control group in social reality and introducing a placebo-like neutral information in behavioral policies, endogenous social forces and processes cannot be suspended during a policy experiment. Treating data in a quasi-experimental way by randomization, instrumental variation and discontinuity design can increase the robustness of estimates, thus improving the internal and external validity of causal inferences. Here, we suggest a complementary strategy: the use of agent-based modeling (ABM) as *in silico* experiments accompanying, augmenting, or even substituting RCTs—whenever needed—in the traditional toolbox of the experimentalist policy analyst.

This policy function of ABM is key especially when: (a) there are no or insuffcient empirical data on which to corroborate estimated causal relationships and perform *ex post*, counterfactual assessments; (b) the economic, social, or political costs of RCTs for policy appraisal or assessment are hardly sustainable; (c) 'social experimenters' are interested not only in estimating outcomes but also understanding generative processes; (d) there is added value in exploring extreme, boundary, or counterfactual conditions that either do not exist in reality or have not yet occurred but in principle could. In all these cases, we argue that ABM is the only alternative to *ex post* observational analysis to explore and quantify hypothesized relationships between policy interventions and social outcomes. What is lost in terms of empirical realism is gained in terms of understanding the possible generative processes.

Reviews on recent applications of ABMs in various felds, from public health (Giabbanelli et al., 2021; Tracy et al., 2018) to agriculture (Kremmydas et al., 2018) and energy consumption (Klein et al., 2019), have shown that ABM is particularly suitable for providing insights into causal mechanisms, potentially linking interventions to outcomes. By generating "artifcial data" via computer simulation, models can help to: (a) explore cases of multiple realizability (i.e., the same effect generated by different social causes and paths), (b) build 'what-if' scenario analysis that supports inferences about interventions-outcomes without impacting the targeted population; (c) estimate 'interference', network effects and spillovers of policy interventions (e.g., the situation in which one individual's exposure affects other individuals' outcomes); and (d) measure possibly multiple direct and indirect outcomes of the same intervention (Chalabi & Lorenc, 2013; Murray et al., 2021; Powell et al., 2017).

While most research has outlined the differences between ABM and more conventional policy approaches and methods, e.g., RCTs (e.g., Gilbert et al., 2018), here we would like to discuss complementarities and potential synergies between various experimental approaches. Indeed, as exemplifed by Bravo et al. (2012), by using the computer as an 'artifcial experimental environment', model parameters can be calibrated on existing individual (experimental) data to perform *in silico* counterfactual tests on any established causal relationship by quantifying the effect of varying initial conditions, especially those that could not be estimated empirically. What could happen to the observed causal relationship between *A* (intervention) and *B* (outcome), if certain hypothesized conditions *C* (either observable or not) were different? Why would *A* necessarily lead to *B* given that *C* may include adaptive, unpredictable individual behavior? As suggested by Manzo (2022), this is not only a problem of internal vs. external validity of estimated relationships (the effect of *A* on *B* would be contingent to a specifc empirical instance with all due problems of generalization). It implies a search for causal or dependence relationships of interest not only within data but also via formalized models of "generative mechanisms" that consider mediating behavior and processes on which we might not have any data. Why and how, when exposed to *A* and under interaction effects that typically occur in social contexts, would individuals behave in such a way to 'cause' the emergence of *B*?

The rest of the chapter is organized as follows: In Sect. 9.2, we provide a brief introduction to ABM, by highlighting their specifcity compared to other modeling approaches. In Sect. 9.3, we present some hypothetical policy cases on which the advantages of ABM can be understood. Model code is provided to help the reader to understand the potential of ABM for: (1) exploring the effect of parameter variations on the emergence of social outcomes; (2) building alternative scenarios to understand the effect of individual reactions on social outcomes. In Sect. 9.4, we summarize the main contributions of the chapter and discuss critical points and possible developments. Indeed, besides the (many) positive aspects, ABM has also certain weaknesses, including problems of model resolution, empirical validation, and external validity, which all require careful scrutiny.

#### **9.2 Agent-Based Modeling**

Agent-based modeling is a "computational method that enables a researcher to create, analyze, and experiment with models composed of agents that interact within an environment" (Gilbert, 2008). Agents may represent individuals, households, organizations, or any other entities, whose actions depend on conditional or stochastic decision-making rules (Bianchi & Squazzoni, 2015; de Marchi & Page, 2014; Macy & Willer, 2002; Tesfatsion & Judd, 2006). Agents can adapt their behavior in response to their own experience (e.g., learning), the interaction with other agents or in response to changes in the environment—e.g., policy interventions (Gilbert & Troitzsch, 2005; Squazzoni, 2012; Tracy et al., 2018).

As dynamic and process-based, ABMs are ideal to study the effects of complex interactions between micro- and macro-levels by exploring 'generative explanations' of social outcomes (Epstein, 2006; Hedström & Bearman, 2009; Macy & Flache, 2009). This is especially important in the case of complex adaptive social systems, whose stochastic, non-linear behavior can seldom be mathematically tractable and cannot be estimated deductively without computer simulation exploring various initial conditions and possible input/output paths (Miller & Page, 2009).

Unlike statistical models, which concentrate on relations between aggregate factors (Bianchi & Squazzoni, 2020), ABM starts from representing individual behavior and ends up exploring aggregate dynamics from agent interaction via computer simulation. Social regularities and patterns are neither derived by estimating the values of stochastic parameters that would maximize a model's ftness to observed data, nor obtained by assumptions on aggregate properties that do not consider individual-level differences (e.g., Hedström & Manzo, 2015; Hedström & Udehn, 2009). ABM parameters are not estimated a posteriori, they are manipulated a priori following an experimental rather than an observational research design (Squazzoni, 2012).

Indeed, instead of being inferred from (or tested against) empirical data, the model allows us to explore hypothesized micro-social processes according to this Coleman-like connection: (a) initial macro parameter conditions → (b) heterogeneous individual behavior → (c) interaction effects → (d) social outcomes (Coleman, 1990). In line with the so-called 'analytical sociology' agenda (Hedström & Bearman, 2009; Hedström & Manzo, 2015; Manzo, 2022), ABMs can be viewed as generative models ensuring a high degree of internal validity regarding the "generative suffcient conditions" leading from (a) to (d) via the manipulation of (b) and (c) (Epstein, 2006). Unlike statistical models, generative explanations via ABM does not require the independence of observations as they aim to explore systemic, interdependent social processes, i.e., specifc confgurations of (a), (b), and (c) that would determine (d). Furthermore, ABM allows us to explore various patterns of agent interaction directly within explicitly represented network structures (Macy & Flache, 2009).

While traditional equation-based models condense either a 'representative', collective agent or a homogenous population into stochastic parameters (e.g., think about the modeling tradition in either standard economics or demography), ABM explicitly considers a population of heterogeneous, autonomous agents with different features and decision-making rules who interact either directly or indirectly while being exposed to various environmental stimuli, typically manipulated by the model maker (Gilbert, 2008; Macy & Flache, 2009; Macy & Willer, 2002; Squazzoni, 2012). By running experiments with human subjects, experimentalists aim to test theoretically deduced hypotheses on cause–effect relationships by manipulating the occurrence of an *explanans* (i.e., the treatment) in a randomized sample of individuals and studying the control vs. treatment group differences in the *explanandum*. In a similar fashion, an experimenter can use ABM to run several instances of a model by manipulating the *explanans*—i.e., changing the related model parameters—and then studying any differences in the simulated outcome. Instances could be designed as 'group-treatment' policy correlates, artifcial agents (whose behavior could be empirically inferred from experimental data, if the ABM exercise is combined with a behavioral experiment, or theoretically postulated if data is not available) would be the correlates of experimental subjects, and their group-level reactions would be the outcome measurement. As such, the computer is used as an artifcial laboratory where theoretically derived hypotheses are tested in silico by comparing a baseline (control group) initialization with manipulated scenarios (treatments) where the only difference is the introduction of a possible *explanans* (Squazzoni, 2012).

However, this does not constrain ABM to 'thought experiments' (Axelrod, 1997). Quantitative (e.g., population size, resources, network positions) and qualitative parameters (e.g., rules of behavior) related to (a), (b), and (c) can be calibrated according to empirical data (i.e., *empirical calibration*), and aggregate artifcial outcomes (d) can be compared to empirical time series or distributions to adjudicate among potential confgurations of (a), (b), and (c) those with higher explanatory power (i.e., *empirical validation*) (Boero & Squazzoni, 2005).

#### **9.3 Exploring Artifcial Policy Scenarios**

In this section, we provide some abstract examples from our own research to illustrate the ABM approach to policy scenarios. Although there are many examples of concrete applications of ABM for policy interventions or design (e.g., Gilbert et al., 2018), here we have summarized two recent contributions that describe our idea of in silico experiments.

#### *9.3.1 Interventions to Increase Competition or Collaboration in Science*

Today, academic life is characterized by a "publish or perish" ethos and growing competition for funds and academic career (Edwards & Siddhartha, 2017; Grimes et al., 2018). While competition is expected to stimulate the quality of publications, scientists must also collaborate especially in reviewing manuscripts before publication to defend robust academic standards of knowledge. This is the important function of "peer review": vetting scientifc manuscripts submitted by authors for publication to a journal by voluntary collaboration of experts guided by journal editors. Unfortunately, research has shown that lack of material incentives or a weak system of symbolic rewards can undermine peer review, as scientists would reduce time and effort in reviewing (typically voluntary and not rewarded), to maximize their efforts in new publishable research which funds, prestige, and career depend on.

Suppose that you are a policymaker wanting to test certain possible interventions to increase cooperation among scientists, but who also want to ensure that this does not compromise the quality of publication. Here are two examples of possible research policy interventions. The frst represents a policymaker wanting to increase quality signals of publication so to induce scientists to compete for excellence, e.g., promoting only those scientists who publish in top journals. The second wants to reward peer reviewing by introducing an open science policy that would induce journals to shift from confdential to open peer review so that the identity of any reviewer is public, regardless of the fnal decisions on manuscripts. This would permit reviewers to claim their review as a reward. Note that even if abstract, both policy interventions are 'realistic': scientists are increasingly exposed to competitive rewards under the dominant rhetoric of excellence and comprehensive evaluation in almost all institutional contexts (e.g., Forsberg et al., 2022). In the second case, scientifc associations and certain publishers have started to introduce open peer review policies as a means to recognize and reward reviewers (Bravo et al., 2019). Therefore, these examples are abstract (i.e., there is no 'real policy maker' commissioning a computational test of such policies) but not completely unrealistic (i.e., these interventions have been explored more locally and by trial and error).

Suppose we prepare a model to test these possible interventions. Assume a population of *n* agents representing a community of scientists. Assume that scientists are hired by academic organizations that periodically provide them with some minimal funding *Ri* (e.g., laboratory equipment, access to online resources, etc.), allocated from a fxed overall amount of resources, *R* = ∑*iRi*. Assume that scientists are required to publish manuscripts to get more funds, reputation, prestige, and career, but that journals are competitive and so accept only a fxed proportion (*P*) of submitted manuscripts depending on a quality ranking determined by reviewers. Scientists then update their resource share according to their publication record as follows:

$$\mathcal{R}\_i = \frac{p\_i}{\sum\_i p\_i} \mathcal{R}$$

Suppose that, at each time step (*t*), scientists are required to perform two tasks, i.e., submitting their manuscripts to journals and reviewing manuscripts submitted by others (for the sake of simplicity, let us assume that each manuscript is submitted by only one author and is reviewed by only one reviewer; for a similar model, where we varied the number of reviewers, see Bianchi & Squazzoni, 2016). Assume that time is a scarce resource and both tasks are costly in that scientists need to decide how to allocate their resources between these two tasks.

Assume that the quality of submitted manuscripts (*Qi <sup>s</sup>* ) and review reports (*Qi r* ) linearly depends on the amount of resources allocated by scientists to these two tasks, as in:

$$\mathcal{Q}\_i^{\,^s} = e\_i \mathcal{R}\_i$$

$$\mathcal{Q}\_i^{\,^r} = \mathcal{R}\_i - \mathcal{Q}\_i^{\,^s} = \left(1 - e\_i\right) \mathcal{R}\_i,$$

where ei determines how resources are allocated between submitting and reviewing.

Following Squazzoni and Gandelli (2012, 2013), we assume that reviews may be biased, so the actual quality of manuscripts could be only approximated by the reviewer depending on the level of resources individually invested by the scientist in reviewing (higher investment = more precise evaluation of the quality of manuscripts), as follows:

$$
\hat{Q}\_i^s = \alpha\_j \mathcal{Q}\_i^s,
$$

with *αj* being drawn from a normal distribution *N T Qj <sup>r</sup>* ( , 1 min , , where *j* is the reviewer and *T*<sup>∗</sup> is a quality threshold which estimates the minimum amount of resources needed by each *j* to provide a fair review.

Suppose that the quality of manuscripts can be unequivocally quantifed so that manuscripts can be compared and ranked by journals for publication. Suppose we do not consider the role of editors, the presence of multiple journals, the possibility of resubmitting rejected manuscripts and other 'realistic' conditions. Let us


**Table 9.1** Pseudo-code of the model (for more detail, see Bianchi et al., 2018)

consider these factors as irrelevant here (see the pseudo-algorithm describing the model in Table 9.1).

Let us next run our simulations for a suffcient number of iterations (in our cases, *m* = 1500) to reach a stable outcome equilibrium (in our case, we repeated our simulations at least 100 times for each initialization) and measure the outcomes as follows: (1) publication bias (i.e., the proportion of incorrectly rejected submissions on the total amount of published articles); (2) the average quality of publications; (3) average quality of the ten top-quality articles. All measurements are in time steps and so can be averaged at the end of each simulation (see the model parameter in Table 9.2).

#### **9.3.1.1 Example 1**

Let us now suppose that we want to explore a set of potential interventions to stimulate scientists to increase their quality of publication ((2)) while at the same time, minimizing publication bias at the system level ((1)). For instance, the policymaker could set up rewards or prizes to this purpose but would like to estimate the


**Table 9.2** Example 1: Model parameters

Adapted from Bianchi et al. (2018)

mediating effect of scientists' possible reactions. You could create two 'treatment scenarios': one in which rewards point to strong competition and excellence, e.g., scientists are induced to compare their *Qi <sup>s</sup>* (regardless of whether their submission was published or rejected) in the top ten publications (we called it "high competition"), another one in which rewards point to the average quality (we called it "minimum expected quality"), e.g., scientists use the average quality of below-median published articles as a comparison. In both scenarios, suppose that these comparisons would determine an individual binary satisfaction value, which would make scientists revise their resource allocation decisions between investing more either in their own manuscripts or for reviewing other manuscripts.

Now, let us hypothesize three possible decisions made by scientists: (1) always selfshly investing in their own publication against peer reviewing, (2) investing more in reviewing when their manuscripts have been previously rejected, and (3) investing more in reviewing when their manuscripts have been previously published. Let us then add a control factor: a level of subjective overconfdence when scientists compare the quality of their own manuscripts with current publications by others. This can be done by re-running all the same simulation scenarios while differing for two further conditions: all scenarios initialized with 'objective' comparison vs. all scenarios with 'subjective' quality comparisons. This factorial design would imply measuring the same outcomes. Then, let us suppose that you create an artifcial 'control group' where you remove any comparison where scientists would follow their allocation strategies without any intervention regarding 'excellent' or 'minimum expected quality' signals.

We calculated cumulative moving average values of our outcomes on the last 100 steps of each iteration and the mean value of outcome measurements for each scenario. Table 9.3 shows the frst outcome ((1)), i.e., publication bias, when scientists were induced to compete for excellent or looked at minimum expected quality adapt their allocation strategies accordingly. Confront the outcomes with the control group. Adding rewards for excellence determined high publication bias than 'minimum expected quality' signals. However, outcomes vary greatly depending on the scientists' adaptive reactions. Note that reviewing only after being published, e.g., a reciprocal behavior, without considering any comparison of quality was detrimental


**Table 9.3** Evaluation bias (%) in different scenarios. (Mobile mean values over 100 repetitions)

Adapted from Bianchi et al. (2018)

**Table 9.4** Average published quality in different scenarios. (Mobile mean values over 100 repetitions, then normalized 0–1)


Adapted from Bianchi et al. (2018)

to the publication bias. Furthermore, counterintuitively, overconfdence had a positive effect in both scenarios, especially in the high competition scenario (29.47%), where publication bias decreased even below the outcome of the 'control group' scenario (32.71%). Therefore, results suggest that publication bias was higher under stronger competition but precise effects depended on various behavioral factors.

If we were to consider the second outcome of interest, however, ((2), i.e., the average quality of publications), results did not vary similarly to the frst outcome, i.e., publication bias. The highest value was achieved when scientists were induced to compete for excellence and reciprocated higher investment in reviewing whenever previously published (see Table 9.4). This was confrmed when considering the quality of the top ten published articles across different scenarios (see Table 9.5). In conclusion: (a) policy interventions that increase competitive spirits of scientists towards publications could backfre if norms of peer reviewing cannot be enforced; (3) even a minimal level of overconfdence can determine positive or negative outcomes compared to more objective self-evaluation (for detail, see Bianchi et al., 2018).


**Table 9.5** Average publication quality of top ten published papers across different institutional settings and behavioral strategies. (Mobile mean values over 100 repetitions, then normalized 0–1)

Adapted from Bianchi et al. (2018)

#### **9.3.1.2 Example 2**

Now let us suppose that we would like to manipulate the peer-review policy adopted by journals testing the effect of shifting from confdential to open peer review in situations in which scientists would be sensitive to competition and status when reviewing others' manuscripts. Under confdential peer review, authors and reviewers do not know each other's identity and so they could just react to their own rejections by reducing their effort *ei* in reviewing to punish the system which did not favor them. Under open-peer review, author and reviewer identities are disclosed and so scientists could reciprocate positive or negative editorial decisions by adapting *Q*ˆ *i <sup>s</sup>* once they are later matched by the journal. Note that the sensitivity of scientists to this shift of the peer review model has been found in some recent 'quasi-experimental' analysis (e.g., Bravo et al., 2019). Do the positive benefts of open peer review come at the price of increasing publication bias, if scientists can react to status and competition and use peer review to either help favorable or punish unfavorable authors who previously reviewed their own manuscripts? Can we ideally quantify how much that price would be?

Table 9.6 shows the initial parameters of this model. We tested various possible behaviors with a focus on reviewing (e.g., always being fair, being randomly reliable, deciding how much to invest in reviewing depending on previous rejection or acceptance of their manuscript). Here, we concentrated on comparing different reviewers' reactions to previous experience as authors in two journal settings: (1) journals following confdential peer review, in which reviewers invest in reviewing whenever previously published or otherwise disinvest, so providing unreliable reports; (2) journals following open peer review, in which reviewers and authors' identities are revealed and reviewers reciprocate positive reviews to authors who previously favored them when reviewers, and negative reviews to previously unfavorable reviewers.

Figure 9.1 shows the frst outcome of interest ((1)), i.e., publication bias, when journals follow confdential peer review and reviewers are either always fair, always unreliable, or sensitive to previous experiences as authors (e.g., being fair when



Adapted from Bianchi and Squazzoni (2022)

**Fig. 9.1** The impact of reviewer behavior on publication bias in confdential peer review. Circles: fair; squares: unfair; triangles: reactive. Values averaged over 200 realizations. (Source: Bianchi & Squazzoni, 2022)

previously treated fairly, being unfair when previously being treated unfairly). If reviewers react to previous experience, the level of bias approximates a random situation in which the publication of manuscripts could be decided by editors tossing a coin. Let us use these outcomes as a baseline to compare the effect of reciprocity strategies in the two peer review settings.

Figure 9.2 shows the frst outcome of interest ((1)), i.e., publication bias, when comparing reciprocal strategies in the two peer review settings. Publication bias increased more than 20% under open peer review and added an extra 20% of bias compared to a situation where editorial decisions would be random. This would

**Fig. 9.2** The impact of scientists' reciprocity strategy on publication bias in confdential vs. open peer review. Triangles: indirect reciprocity (confdential peer review); circles: direct reciprocity (open peer review). Values averaged over 200 realizations. (Source: Bianchi & Squazzoni, 2022)

suggest that open peer review could be detrimental whenever we assume that reviewers are sensitive to cooperation signals. Further results (reported in Bianchi & Squazzoni, 2022) indicate that even if reviewers would retaliate only against previous reviewers of lower academic status (i.e., with lower resources compared to theirs) while being fair in case previous unfavorable reviewers were scientists of higher status, the effect on the outcome would differ only minimally (differences not higher than 5% on the level of publication bias).

Figure 9.3 shows the effect of reviewer behavior on the second outcome (((2)), i.e., the average quality of publications. Open peer review would determine the lowest quality of publications even when compared to random editorial decisions. Note that we tested the sensitivity of these outcomes to the variation of all initial parameters and fndings were confrmed (see the Supplementary Material of Bianchi & Squazzoni, 2022). In conclusion, this exercise would suggest that if practices and norms exist that make scientists frame peer review as a signaling game, open peer review polices, once adopted globally, could increase publication bias by more than 20% compared to confdential peer review, thus compromising publication quality. Obviously, other computational tests could also be designed with the model by considering for example other factors, being more nuanced, and considering empirically grounded behavior. Although a more realistic and empirically calibrated parameterization of the model would be important, as suggested by Feliciani et al. (2019) in their overview of computer simulation research on peer review, these cases here were only aimed to exemplify a method to test policy interventions artifcially.

**Fig. 9.3** The impact of reviewer behavior on the average quality of published papers under different peer review models. In the rectangle: comparison between reciprocity strategy in confdential (black) vs. open peer review (white). Values averaged over 200 realizations. (Source: Bianchi & Squazzoni, 2022)

#### **9.4 Conclusions**

In this chapter, we have presented ABM as a method to perform computational experimental tests on non-linear, complex effects of policy interventions as these can determine interaction effects and individual adaptations. This could enlarge the toolbox of experimental policy analysists, especially when RCTs cannot be designed due to various ethical, political, or economic constraints. *In silico* tests are also required before policy design to explore potential unintended consequences or when an understanding of social processes could provide relevant insights to enhance comprehensive policy appraisal. In our view, ABM can fruitfully complement, enrich, and even substitute—when necessary—more conventional behavioral methods for public policy.

However, the use of ABM also has important limitations. As discussed by Gilbert et al. (2018) in a comprehensive review of practices of computational modeling of public policy, deciding the appropriate model resolution requires critical decisions.

Besides the hypothetical exercises presented here, where we have proposed abstract examples, in concrete contexts, the optimal level of abstraction of a model depends on the purpose of modeling and the nature of the system being modeled (Edmonds et al., 2019). For instance, during the COVID-19 pandemic, epidemiologists have used ABMs to simulate a variety of anti-contagion policies to fatten the curve by reaching an appropriate level of resolution on certain parameters (e.g., population size). However, they followed empirically implausible assumptions on relevant others (e.g., social networks and externalities), which compromised a more comprehensive exploration of possible policy interventions while downplaying the fundamental role of uncertainty (see Squazzoni et al., 2020 for a critical overview; for an example of empirical calibration of networks in epidemiological models, see Manzo & van der Rijt, 2020).

This raises two interrelated challenges in the use of ABM for public policy, i.e., the use of empirical data to calibrate model parameters via existing or *ad hoc* data, and the heuristic value of model fndings to inform policy interventions or policy evaluation (Tracy et al., 2018). In this regard, as suggested by Murray, Marshall & Buchanan (2021, 1655) in their proposed 'target trial framework', whenever combined with the usual experimental framework of behavioral policy, ABM could incorporate empirical data on the targeted population (e.g., calibrating salient characteristics of individuals from available data sources) and a detailed and explicit specifcation of the hypothetical trial, while using the in silico experimental nature of these models as an 'artifcial world' "with no ethical, logistical, or fnancial constraints, and in which the exposure of interest is perfectly manipulable by study investigators, regardless of whether this is actually feasible or ethical in the real world." This would help to fll the gap between empirical data and unobservable variables and inform study design. Furthermore, following Bravo et al. (2012), calibrating ABM with results from small-scale pilots, RCTs or well-detailed observational studies or re-running existing trials in a model, while scaling the characteristics of the original target population to populations with other characteristics or testing other network structures compared to those originally reproduced in the previous study, could help us to increase generalization or perform counterfactual tests of policy fndings. This would help to assess the dependence of outcomes from contextual details and help us understand how much causal inference exercises on complex social behavior require careful examination.

#### **Suggested Readings**


Page, S. E. (2018). *The Model Thinker*. New York, NY: Basic Books.

#### **Review Questions**


#### **Replication Material**

The models have been built in NetLogo. The code is available at the following links:


#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 10 The Many Threats from Mechanistic Heterogeneity That Can Spoil Multimethod Research**

**Markus B. Siewert and Derek Beach**

**Abstract** The combination of cross-case and within-case analysis in Multi-Method Research (MMR) designs has gained considerable traction in the social sciences over the last decade. One reason for the popularity of MMR is grounded in the idea that different methods can complement each other, in the sense that the strengths of one method can compensate for the blind spots and weaknesses of another and vice versa. In this chapter, we critically address this core premise of MMR with an emphasis on the external validity of applying some cross-case method, like standard regression or Qualitative Comparative Analysis, in combination with case study analysis. After a brief overview of the rationale of MMR, we discuss in detail the problem of deriving generalizable claims about mechanisms in research contexts that likely exhibit mechanistic heterogeneity. In doing so, we clarify what we mean by mechanistic heterogeneity and where researchers should look for potential sources of mechanistic heterogeneity. Finally, we propose a strategy for progressively updating our confdence in the external validity of claims about causal mechanisms through the strategic selection of cases for within-case analysis based on the diversity of the population.

#### **Learning Objectives**

By studying this chapter, you should be able to:


M. B. Siewert

D. Beach (\*) Aarhus University, Aarhus, Denmark e-mail: derek@ps.au.dk

Munich School of Public Policy, Technical University of Munich, Munich, Germany e-mail: markus.siewert@hfp.tum.de


#### **10.1 Introduction**

Over the last decades, multimethod research (MMR) has gained considerable popularity in the analysis of public policy (see Fielding, 2010; Hendren et al., 2018; Wolf, 2010 for overviews about MMR studies in public policy), echoing a general trend in the political and social sciences (seminally, Lieberman, 2005; for up-to-date discussions, see Beach and Kaas 2020; Goertz, 2017; Humphreys & Jacobs, 2015; Seawright, 2016). Many texts defne MMR as any research design which uses two or more methods to analyze the same research topic, often involving cross-case analysis of patterns of association between causes and outcomes and within-case analysis of how the causal linkage(s) work (see Creswell & Plano Clark, 2018; Schoonenboom & Burke Johnson, 2017; Tashakkori & Teddlie, 2021 for various defnitions).

The most common type of MMR in political science involves the combination of some form of cross-case analysis, e.g., using regression-based methods (see Chaps. 4 and 5), or some variant of mediation analysis (see Chap. 6) or Qualitative Comparative Analysis (see Chap. 7), and one or several within-case studies using methods like congruence analysis or process tracing (see Chap. 8).1 The cross-case analysis enables the identifcation of the net causal effects or invariant association between X and Y, i.e., does X make a difference for Y? The within-case analysis, on the other hand, focuses on the causal linkage *aka* mechanism(s), i.e., how does X work to bring about Y? The core logic behind this variant of MMR, in a nutshell, is that combining methods that allow for different kinds of inferences bears the potential to use the particular strengths of one technique to cancel out the other's weaknesses, and vice versa (e.g., Beach, 2020, 163; Clarke et al., 2014, 341; Goertz, 2017, 5–6; Lieberman, 2005, 436; Weller & Barnes, 2016, 426–27). In doing so, the promise of MMR is that its design ultimately yields more robust inferences by shedding light on social phenomena or substantiating our understanding of policy problems from different analytical perspectives.

The question of whether MMR can deliver on this promise – whether different methods can effciently complement each and strengthen overall causal inferences

<sup>1</sup> In this chapter, we deliberately leave aside the question of MMR using interpretative techniques. Irrespective of their many merits, interpretative techniques concentrate on research themes that fundamentally differ from the type of causal questions addressed in this chapter. Hence, we remain within the broad ontological assumption that causation exists in the form of causal effects/invariant associations and causal mechanisms and that they can be examined empirically (see Chap. 2) – a thread that connects all contributions in this volume. For recent developments in interpretative methods, see Schwartz-Shea and Yanow (2012).

"because taken on their own each sort of evidence has signifcant limitations" (Clarke et al., 2014, 341) – has not gone uncontested. In fact, there is a notable strand within the methodological literature refecting upon the notion of mutual complementarity in MMR. The core of this debate deals with whether different methods that make different types of causal claims and use different types of evidence can really be merged as seamlessly as is frequently portrayed (Beach and Kaas, 2020). Among other things, it has been highlighted that MMR can involve the problem of conceptual stretching or might even introduce conceptual incongruity if specifc causal properties are added/dropped from concepts when moving between the cross-case and the within-case level of an analysis (Ahmed & Sil, 2009; Ahram, 2013). Similarly, while case studies can be used to check for measurement errors or to develop context-sensitive indicators (e.g., Seawright, 2016, 50–53), it can be that translating within-case observations into comparable cross-case data, and the other way around, is neither intuitive nor straightforward (Ahram, 2013; Kuehn & Rohlfng, 2009). Finally, it has been frequently mentioned that case studies can be used in MMR to check for under- and/or overspecifcation of the explanatory model at the cross-case level (Lieberman, 2005; Seawright, 2016: 67–74). Yet, Rohlfng (2008) convincingly shows that model misspecifcations can travel between different levels of analysis because residuals and effect sizes might point towards the wrong cases for further within-case study, hence aggravating the situation, since an incorrect model is corroborated by looking at the wrong cases. In short, numerous pitfalls can complicate the effective integration of different approaches and methods in MMR designs.

This chapter concentrates on another signifcant problem: How can insights about causal mechanisms gained by studying how they work in one case be generalized to cases that we have not studied using case studies but look similar at the cross-case level? The issue of generalization has so far largely been ignored in the political science literature on MMR. As we will show below, generalizing about mechanisms is particularly diffcult in settings that exhibit mechanistic heterogeneity. We defne *mechanistic heterogeneity* as a scenario where multiple different mechanisms link the same explanatory factor(s) X to the same outcome Y (Álamos-Concha et al., 2021; Beach et al., 2019). For instance, we might fnd out that epistemic authorities (*aka* experts) gained infuence over a policy in one case through a mechanism involving a process where the experts gained access to decision-makers by joining the bureaucracy itself (Löblová, 2018). However, in another case, infuence might have been achieved through other processes, such as experts or lobbies' framing of the debates from the outside.

This form of heterogeneity and complexity at the level of mechanisms is widely discussed in the literature on case-based methodology (Beach & Pedersen, 2016, 2019; Bennett & Checkel, 2015; Blatter & Haverland, 2012; Falleti & Lynch, 2009; George & Bennett, 2005; Rohlfng, 2012). However, it is largely neglected in most accounts that deal with the integration of cross-case and within-case analysis (but see Beach et al., 2019; Goertz, 2017; Weller & Barnes, 2016), which is why we do not yet have a good understanding of how to deal with the issue of making crosscase and within-case analysis communicate in MMR. To put it simply, the cross-case analysis tells us about differences and similarities at the level of X's and Y's; in contrast, the within-case analysis tells us about linkages (if any) between X and Y. In fact, we are making different types of causal claims, using very different types of empirical material (Clarke et al., 2014).

Addressing this question in the context of a volume on causation in policy studies is important for several reasons. First, we can observe an apparent 'mechanistic turn' in the social sciences which gradually expands across its subfelds, including the feld of public policy analysis (e.g., Capano et al., 2019; Capano & Howlett, 2021; Fontaine, 2020; Kay & Baker, 2015; Lindquist & Wellstead, 2019; van der Heijden et al., 2019). For instance, Fontaine (2020, 274) stresses that there is an emerging consensus on the fact that producing evidence about mechanisms via process tracing bears a signifcant "potential contribution to comparative policy analysis." Capano and Howlett (2021, 142 italics in the original) go one step further, arguing that "[p]olicy-makers [..] need a realistic causal theory about what occurs when policy tools are deployed and how it occurs if they want to design something that will actually happen more often than not, and to escape the trap of poorly conceived and related tacit knowledge, experience, and heuristics." Yet, secondly, if we accept that producing comprehensive causal explanations requires both robust evidence that a probable cause X is correlated/associated with Y as well as sound evidence for the causal mechanisms linking X and Y, the ability to generalize mechanistic claims from one studied case to other cases belonging to the same population becomes a signifcant issue. In one case study, we might have found that the linkage worked in one way, but how would we know whether the linkage (if any) is similar in other cases if we have not also investigated them? For instance, can we assume that a particular strategy used by a political entrepreneur that worked during a crisis would work in other situations? Assuming that mechanisms work in similar ways in other, non-studied cases is in effect generalizing based on hope instead of evidence. If researchers and policymakers need to know what works, how, and under what conditions, a well-informed mapping of the underlying mechanisms operative within a population of cases is crucial to generalize how X and Y are linked in different cases within a population.

The chapter is structured as follows: Section 10.2 outlines the basic ideas behind MMR designs, introduces the main templates, and discusses key ontological and epistemological differences when combining cross-case and within-case analysis. Section 10.3 addresses the problem of mechanistic heterogeneity by illustrating what heterogeneity at the level of mechanisms means. After that, Sect. 10.4 presents a selected set of potential sources to which researchers should turn to check for mechanistic heterogeneity in MMR. In Sect. 10.5, we discuss a stepwise generalization strategy that is sensitive to mechanistic heterogeneity and whose primary goal is to progressively update the confdence in the external validity of mechanisms by gradually expanding the knowledge about how mechanisms work in different (sets of) cases. The chapter closes with some fnal remarks.

#### **10.2 Basic Ideas Behind MMR**

The main rationale behind combining cross-case and within-case methods in MMR is that it allows researchers to make different types of causal inferences (e.g., Beach, 2020; Beach & Rohlfng, 2018; Goertz, 2017; Lieberman, 2005; Rohlfng & Schneider, 2018; Seawright, 2016; Weller & Barnes, 2016). On the one hand, crosscase analyses are particularly good at identifying cause–effect relationships by examining regular associations in the form of controlled experiments, correlations, or set-relations across a sample of cases. On the other hand, within-case analyses can establish the causal linkages between one or several causes and the respective contributions by tracing the underlying causal mechanism(s). By integrating both analytic perspectives and using methods in combination to address a shared research theme, it is argued that one can strengthen the soundness and robustness of the inferences since each mode of analysis has particular strengths that can make up for the other's blind spots (Cartwright, 2011; Clarke et al., 2014; Steel, 2008).

But how does this division of labor work in research practice? The literature on MMR has produced numerous taxonomies and typologies of different designs (see Bryman, 2006; Creswell & Plano Clark, 2018; Schoonenboom & Burke Johnson, 2017; Tashakkori & Teddlie, 2021, among others). One common defning element is whether the methods are applied in parallel or sequentially. In *parallel designs*, two or more methods are applied simultaneously; in *sequential designs*, one is used after the other. A different feature is whether the parts of an MMR study depend on each other or are performed independently. In the former scenario, insights from one study inform the data collection and/or analysis of the other; in the latter scenario, data collection and/or analysis are performed separately within each method.

The sequential research strategy is probably the most common in political science research. Two variants are typically distinguished (e.g., Beach & Rohlfng, 2018, 11–18; Lieberman, 2005; Rohlfng, 2008; Rohlfng & Schneider, 2018, 44–45; Seawright, 2016). In 'cross-case frst/within-case second' designs, the researcher starts with some form of cross-case analysis to identify robust connections between a (set of) explanatory factor(s) X and an outcome of interest Y. This is followed by one or several case studies based on the fndings of the frst analytic step. On the other hand, 'within-case frst/cross-case second' designs follow the opposite logic. Here, the analysis starts at the within-case level to uncover some causal connection and/or mechanisms and then continues with the cross-case analysis to explore whether the identifed relationship also holds across a population of cases.

While one of the original motivations behind the methodological work on MMR was to (at least partially) overcome the divide between qualitative and quantitative methods, recent debates have again emphasized the ontological and epistemological differences between research approaches and the challenges they create for integrating methods from the different cultures into an (at least somewhat coherent) MMR design. At least two types of approaches can be differentiated: variance-based and case-based (for the following, see Beach & Kaas, 2020).

Variance-based approaches to MMR build on a counterfactual understanding of causation as developed in the Potential Outcome framework. Counterfactual causation is defned as the claim that a cause produced an outcome because its absence would result in the absence of the outcome, all other things being held equal. Without evaluating the difference that a cause can make between the actual and the counterfactual, no causal inference is possible. Therefore, the main causal inference is established at the cross-case level using controlled comparisons. Put it more bluntly, the cross-case method is in the inferential driver's seat, while the withincase serves as an adjunct method.2 This does not mean that the within-case study is not important. It fulflls crucial functions such as validating measurement, establishing a case's counterfactual, reconstructing the causal pathways, or searching for confounders (Seawright, 2016; Weller & Barnes, 2016). Causal evidence, however, lies across cases.

In case-based approaches to MMR, multiple understandings of causation exist side-by-side (Baumgartner & Falk, 2019; Beach & Pedersen, 2019; Rohlfng & Schneider, 2018). They usually have in common that the inferential workhorse in MMR designs is located at the within-case level instead. To establish a causal relationship, it must be checked whether the identifed explanatory factors indeed exert some causal power over the outcome in a case, and if so, how exactly the causal mechanism plays out (e.g., Beach et al., 2019; Schneider & Rohlfng, 2016, 2019). Here, the analysis at the cross-case level plays an adjunct role, e.g., by establishing an X/Y relation in the frst place, guiding the case selection for the within-case study, or mapping the population of cases for further generalization (Box 10.1).

#### **Box 10.1: The Variance-Based and the Case-Based Approach to MMR**

The question of variance-based and case-based approaches to MMR needs to be located in the broader discussions within the philosophy of sciences (e.g., Cartwright, 2011; Russo & Williamson, 2011) and political science. In this sense, it connects to the seminal readings like King et al. (1994), which argued in favor of a shared understanding of causal inferences across quantitative and qualitative (i.e., empirically oriented case-based methods). This has been challenged in recent debates, which (again) points out the ontological and epistemological differences between the qualitative and quantitative methods (Brady & Collier, 2010). Consequently, there has been a rise of methodological guidelines for different MMR designs depending on the research tradition in which it is grounded (see Beach & Kaas, 2020 for an overview).

Variance-based approaches to MMR (e.g., Lieberman, 2005; Seawright, 2016; Weller & Barnes, 2014, 2016), as pointed out in the main text, usually

(continued)

<sup>2</sup> It is important to note that there are alternative proposals. For instance, Runhardt (2015, 2021) envisages a design where controlled comparisons are used at the within-case level where two or more cases are examined to see whether the proposed mechanism made a difference.

#### **Box 10.1** (continued)

are grounded in the potential-outcomes framework (*aka* counterfactual causation). It applies a top-down perspective where the main goal is to identify robust causal effects in a population of cases, or a sample thereof (with randomized controlled trials as a gold standard). This is followed by an assessment at the within-case level of whether the causal relationship holds or not. The cross-case analysis using controlled comparisons is the main workhorse for causal inference, focusing on difference-making. To align cross-case and within-case analysis, variance-based approaches often understand causal mechanisms as intervening variables whose difference-making can be assessed using controlled comparisons between cases.

For case-based approaches to MMR, the ontological underpinnings are varied, relying on regulatory theory (e.g., QCA) or mechanisms (e.g., process tracing) (Beach & Rohlfng, 2018; Goertz, 2017; Rohlfng & Schneider, 2018; Schneider & Rohlfng, 2016; see also Chaps. 1, 2, 6, and 7). However, what is shared by all existing frameworks is that the main causal inference happens at the within-case level through case study methods like process tracing. In this regard, case-based approaches are bottom-up in their focus on causation as it plays out within single cases, after which generalizations might be made to other cases. As regards the understanding of causal mechanisms, there is an emerging consensus on a productive account of mechanisms – which we also subscribe to in this chapter – that understands mechanisms in the form of actors engaging in activities that link a cause and outcome together in a productive causal relationship. Nevertheless, epistemological discussions are still ongoing about how to identify the working of mechanisms (see also Chaps. 2, 6 and 8).

#### **10.3 The Problem of Mechanistic Heterogeneity for External Validity in MMR**

Making generalizations about the working of mechanisms from one studied case to other cases which are not studied is a crucial problem in the social sciences and beyond (e.g., Cartwright, 2011; Khosrowi, 2019; Steel, 2008; Wilde & Parkkinen, 2019). Knowing how a policy intervention works in one case does not necessarily tell us how it would work in other, non-studied cases.

The relevance of this issue is evident in case-based approaches, where the examination of mechanisms is the main inferential workhorse. But the ability to make generalizable claims about mechanisms is also essential for the variance-based approach. For instance, Weller and Barnes (2014, 21) argue that one goal of within-case analysis is "to understand substantive relationships at the level of individual cases and to use those insights to learn something about the population of cases that feature that substantive

relationship." Therefore, large-N mediation analysis (see Chap. 6) is often used to study mechanisms. However, by studying many cases using variance-based methods, one learns about the average causal effects of X (or the intervening variable) on the values of Y. An average does not tell us how the linkage works in any given case. In Cartwright's words, average causal effects tell us that "it works somewhere" while leaving us in the dark about how it actually works in any given case (Cartwright, 2011).

Once we fnd a causal mechanism in a studied case using within-case analysis, the key question asks whether we can infer that a similar – *nota bene*: not exactly the same (!) – mechanism also connects X and Y in other cases. In other words, how do we ensure the external validity of fndings about causal mechanisms? The answer heavily depends on the degree of causal heterogeneity at the within-case level.

We speak of mechanistic homogeneity if two or more suffciently similar mechanisms are operative in all the cases that exhibit the same relationship between X and Y. Mechanistic heterogeneity, on the other hand, refers to two situations: (1) the same X and Y are linked together through different mechanisms (*mechanistic equifnality*), or (2) the same X triggers different mechanisms leading to a different Y (*mechanistic multifnality*) (Beach, 2020; Beach et al., 2019; Beach & Rohlfng, 2018; Falleti & Lynch, 2009; George & Bennett, 2005; Gerring, 2010; Goertz, 2017; Sayer, 2000; Weller & Barnes, 2016).

It is important to note that we do not understand causal mechanisms as chains of events, but instead as process-level causal explanations that provide an account of what actors are doing. This account explains why the actors' activities are linked together and how they contribute to producing the outcome in the case. Of course, these process-level explanations can have varying levels of detail (*aka* abstraction). At the most abstract level are schematic theories that focus on the most critical interactions, describing actors and what they are doing in very abstract terms (e.g., "a political entrepreneur engages in speeches that attempt to frame a debate"). At the other extreme are very detailed, case-specifc accounts that use formal nouns to describe actors, include many different parts, and where activities are specifed in great detail (Box 10.2).

#### **Box 10.2: Causal Heterogeneity**

The term *causal heterogeneity* includes a range of phenomena linked to complex causal patterns that can characterize any X/Y relationship. In the statistical literature, the problem of causal heterogeneity plays a signifcant role, for example, when considering whether different subgroups in a given population react differently to a specifc treatment, e.g., an administered policy instrument (e.g., Seawright, 2016; Pearl, 2017; Xie, Xie et al., 2012). Issues of causal heterogeneity are also prominent in the context of QCA, where they are discussed concerning conjunctural causation, equifnality, and asymmetry (Ragin, 2008; see also Chap. 7). Yet, researchers must be aware that causal heterogeneity not only pertains to X/Y relations but also to the level of mechanisms (e.g., Beach et al., 2019; Beach & Rohlfng, 2018; Goertz, 2017; Weller & Barnes, 2016).

**Fig. 10.1** Abstract examples of mechanistic homogeneity and heterogeneity. Own depiction

Figure 10.1 illustrates the issue of mechanistic homogeneity and heterogeneity using causal diagrams in a stylized form for a simple X/Y relationship.

The frst scenario displays one variant of mechanistic homogeneity where X and Y are connected via the same mechanism (CM1) in both cases. In contrast, the next situations all refer to different forms of mechanistic heterogeneity.

In the second scenario, two single but different mechanisms connect the same X to the same Y, CM1 in one case and CM2 in another case.3

The situation turns more complex in the third scenario. Here, the same X triggers multiple mechanisms in two cases, i.e., mechanistic multifnality, yet there is only one mechanism that is shared by both cases (CM1), whereas the two cases differ on the second mechanism triggered by X, namely, CM2 versus CM3.

Finally, the fourth scenario shows how different mechanisms might interact with each other in different ways across cases – CM1 and CM2 in one case, and CM1, CM2, and CM3 in the second case.

These illustrations are, of course, very simple scenarios. More frequently, explanatory models do not involve one individual factor, but instead several factors X1, X2, X3…, Xi. Here, patterns can become much more complex. Causal mechanisms can work additively or interact with each other, appear in a different sequential order, show complementary instead of conficting effects (among others, see Beach & Rohlfng, 2018, 18–25; Goertz, 2017, 53–57; Mikkelsen, 2017, 429–34; Weller & Barnes, 2016, 433–37 for further illustrations). For instance, X1 and X2 might trigger two mechanisms, CM1 and CM2, but in one case, this happens simultaneously, whereas in other contexts X1 happens before X2, or even that X1 triggers CM1, which then leads to X2 triggering CM2 – highlighting temporal or causal

<sup>3</sup>Of course, heterogeneity applies when the whole process is different from case to case, but also when parts of it display meaningful diversity.

ordering as refections of mechanistic heterogeneity. Another example is discussed under the label of 'masking' (Clarke et al., 2014; Steel, 2008, 68; see also George & Bennett, 2005, 145–47). Masking means that a given X might be linked to the same Y through multiple mechanisms with opposite effects on the Y. For instance, a crisis might trigger a process where some actors engage in a frantic search for solutions and advocate for them. At the same time, the same crisis can push other actors to become risk-averse, thereby starting a process of resistance to any change. In the case, both processes might be operative, and the outcome is a compromise on some modest change that either group did not desire.

#### **10.4 Sources of Mechanistic Heterogeneity in MMR**

When combining cross-case analysis and within-case analysis in MMR to identify causal mechanisms and make generalizable claims about them, a crucial problem is that the information utilized at the cross-case level is usually uninformative about what is going on at the within-case level of mechanisms. Let us revisit the abstract example displayed in Fig. 10.1: there is simply no way to establish how exactly the mechanisms connecting X and Y play out just by looking at the X/Y relations. Against this backdrop, examining how a mechanism works by studying how it works within one case and generalizing to other unstudied cases is extremely risky. Very different mechanistic scenarios might lurk underneath the same X/Y relationship.

Before we sketch out a generalization strategy sensitive to mechanistic heterogeneity in the next section, we discuss three primary potential sources of mechanistic heterogeneity so that researchers are informed about where to look for heterogeneity pitfalls when generalizing mechanistic claims (Box 10.3).

#### **Box 10.3: Potential Sources of Mechanistic Heterogeneity**

As in cross-case analysis, the assumption of causal homogeneity at the level of mechanisms is usually too heroic to be met in the social sciences. We, therefore, argue that mechanistic heterogeneity should be the default assumption when conducting within-case analysis in general and MMR in particular (Beach et al., 2019). Instead of simply assuming that things work in the same way in different cases, researchers should engage in empirical testing of whether mechanistic heterogeneity is present in a population if they want to avoid making fawed generalizations about the working of causal mechanisms.

A *non-exhaustive* list of *non-exclusive* sources of mechanistic heterogeneity includes, inter alia, complex concepts and measures based on multiple attributes with particular causal properties, qualitative hedges within concepts

#### **Box 10.3** (continued)

and measures triggering multiple different mechanisms, omitted causal factors and confounders, varying contexts and differences in scope conditions, factors which are identifed as redundant or insignifcant at the cross-case level, but still have a causal impact at the level of mechanisms, or different forms of temporal and/or causal dynamics which underlie an X/Y relationship.

#### *10.4.1 Complex Concepts or Measures*

The frst source of mechanistic heterogeneity is that concepts and measures used at the cross-case analysis capture more than one causal property and can trigger multiple mechanisms. Concepts in the social sciences are usually thought of as multidimensional constructs that have several analytical levels, i.e., attributes and indicators (Adcock & Collier, 2001; Goertz, 2020). The literature on concepts and concept formation has developed various strategies for systematizing the constitutive properties of a concept so that they can be fruitfully applied in empirical research.

In the so-called *classical approach* to concept formation, the constitutive attributes of a concept are individually necessary and jointly suffcient (Goertz, 2020; Sartori, 1970). The Venn diagram in Fig. 10.2a illustrates the underlying logic, whereby we start from three constitutive attributes (A, B, C). For a case to be captured by a concept using the classical approach, all three properties must be present – i.e., A and B and C. If only one of the three attributes is missing, the given social phenomenon does not qualify as a manifestation of the concept.

On the other hand, the *family resemblance approach* offers an alternative strategy to concept formation. In contrast to the classic approach, concepts only have suffcient attributes without a specifc feature being individually necessary. Under family resemblance, a case is described by a concept when it has at least one of the constituent attributes, regardless of which one. The Venn diagram in Fig. 10.2d illustrates this approach: the presence of either A or B or C – or any combination of the three – is suffcient for the concept to be present (Barrenechea & Castillo, 2019; Goertz, 2020).4

Beyond these two standard approaches to concept formation, *mixed types* can also be possible.

In a variant, for instance, there is no single suffcient attribute for having a concept; instead, several conceptual properties must be present, none of which is necessary. To witness, if we require that two out of three attributes need to be present for

<sup>4</sup> In formal terms, the classical approach to concept formation relies on a logical AND combination, marked by the Boolean '\*'; i.e., A\*B\*C. The family resemblance approach is based on the logical OR combination, marked by the Boolean '+', i.e., A+ B PLUS\_SPI C. See also Chap. 7 on Qualitative Comparative Analysis.

**Fig. 10.2** Concept formation strategies and conceptual heterogeneity. Own depiction based on Barrenechea and Castillo (2019)

a concept, this may mean that the concept describes any case showing A and B, or A and C, or B and C, or A and B and C. Figure 10.2c exemplifes this logic based on three ('*n*') conceptual attributes out of which at least two ('*m*') must be given for the concept to apply.

Another mixed type of the two standards approaches is based on the idea that one or more constitutive properties of a concept are necessary, but additional attributes are required but not necessary. For example, thinking again of a concept made up of three attributes A, B, C, we can envisage that A is necessary, but either B or C must be added for a case to be described by the respective concept. As demonstrated in Fig. 10.2b, the concept only applies if another attribute is fulflled in addition to A.5

What does this have to do with mechanistic heterogeneity? The point is that these structures can introduce different levels of (causal) heterogeneity into concepts (Barrenechea & Castillo, 2019; Beach et al., 2019; Collier & Mahon Jr, 1993; Goertz, 2020). As Figure 10.2a highlights, concepts based on necessary and jointly suffcient conditions are very homogeneous since cases are described by this concept only if they show all three attributes. On the other end of the spectrum, concepts that follow a family resemblance logic show a high degree of potential heterogeneity because a total of seven characteristic combinations lead to the presence of the concept – i.e., all combinations except ~A\* ~ B\* ~ C (Fig. 10.2d). The two mixed types can be located in between. Since different attributes have different causal properties and can trigger different causal mechanisms, it does not need much imagination to envisage that this also leads to mechanistic heterogeneity.

A study by Binder (2015) on the conditions for robust UN interventions in international conficts illustrates this. Here, the factor 'spillover effects' is conceptualized via three attributes that capture different spillover aspects. The three aspects are, frst, refugee fows; second, transnationally operating rebel groups; and third, other negative effects such as drug traffc, terrorism, and economic downturns. To count as a confict with spillover effect, any of the three factors is suffcient following a family resemblance approach. In such a situation, the cases included in the cross-cases analysis which are coded as experiencing spillover effects contain mechanistic heterogeneity by design: some suffer from only one of these factors,

<sup>5</sup>Formally, this can be expressed by A\*(B PLUS\_SPI C).

i.e., refugee fows or transnationally operating rebels or economic downturns, others from a combination of two or even all three factors. But the causal mechanisms triggered by each attribute are most probably very different even though they all are coded as cases of 'spillover effect'.

In situations like these, we do not know which mechanism is actually present in a given case just by looking at the relationship between X (here, spillover effects) and Y (here, UN intervention). Hence, we cannot generalize from one case to any other since it is unclear whether cases that only show high refugee fows trigger the same mechanism(s) as cases with only transnationally operating rebels or all three attributes present. At best, we might generalize to cases that share the same confguration of conceptual attributes. But even this is diffcult, as we highlight below, since there might still be different dynamics at play among cases that share the same attributes.

The problem of (causal) heterogeneity pertains to various concept formation strategies and complex measures. It also occurs if subtypes are constructed and then used in the form of a ranked scale (Collier & Levitsky, 1997; Møller & Skaaning, 2010). It is inherent to index building which rests on the assumption of homogeneity at different levels of the index (Barrenechea & Castillo, 2019). It may also apply to lexical scales where the defning attributes are hierarchically arranged so that the attribute at the lower level is necessary to the next higher level (Skaaning et al., 2015).

All in all, we should expect that causal heterogeneity, and consequently mechanistic heterogeneity, is pervasive when studying public policy phenomena, especially against the backdrop of the widespread use of complex concepts in cross-case analysis. While this might not be a problem if one is only interested in establishing X/Y relations, it becomes a crucial pitfall in MMR if the aim is to generalize the insights gained at the within-case level to a larger sample of unstudied cases. Simply assuming that causal mechanisms play out in similar ways across all cases would not be warranted in this situation.

#### *10.4.2 Known and Unknown Omitted Conditions*

The second source of mechanistic heterogeneity comes from known and/or unknown omitted conditions in cross-case analysis. The problem of *unknown* omitted conditions, i.e., contextual or explanatory factors that are not part of the original model, is frequently discussed in the methodological literature as a problem for MMR (Kuehn & Rohlfng, 2009; Radaelli & Wagemann, 2018; Seawright, 2016; Weller & Barnes, 2016). *Known* omitted conditions, i.e., factors that are not considered in the within-case analysis because they do not make a difference in the cross-case analysis, are less frequently problematized in the literature (but seeÁlamos-Concha et al., 2021 ; Beach et al., 2019 ; Schneider & Rohlfng, 2019).

Conditions omitted in cross-case analysis can substantially impact the withincase level as they can introduce additional mechanisms or interact with existing mechanisms. The problem is straightforward with factors omitted from explanatory models and is widely discussed, for instance, in the literature as potential confounders (e.g., Goertz, 2017; Radaelli & Wagemann, 2018; Seawright, 2016; Weller & Barnes, 2014). Yet, contextual (*aka*, scope) conditions that are omitted can also play an important role because they can impact how mechanisms operate (i.e., Bunge, 1997; George & Bennett, 2005; Gerring, 2010; Goertz & Mahoney, 2009; Sayer, 2000). This line of thinking also fts nicely into the context-mechanism-outcome (CMO) framework developed by Pawson and Tilley (1997) concerning realistic evaluations. In a nutshell, the framework posits that mechanisms underlying any cause–effect relationship need to be properly contextualized, and whether they work in similar or different ways across varying contexts remains an empirical issue. Returning to the above example of spillover effects and the strength of UN intervention (Binder, 2015), one question concerning the generalizability from one case to another would ask whether the mechanisms differ according to the temporal duration of the confict. For instance, during a protracted confict, the intensity of violence might ebb and fow, and there might be several waves of refugees where each wave builds up more and more pressure for international action. A different dynamic might be observed during a short but extremely violent confict. Of course, whether this is meaningful for treating mechanisms as different depends on the theoretical perspective.

While conditions that are not considered in the analysis can play a crucial role in mechanistic heterogeneity and the generalizability of mechanisms across cases, they are not the only source. One problem we might think of when integrating within-case and cross-case analysis to make generalizations about mechanisms is that explanatory factors might turn out as redundant, irrelevant, or insignifcant at the cross-case level, but still have an important causal role to play at the within-case level. This is because, strictly speaking, the level at which causes are operative is always within a single case. Therefore, establishing patterns of difference-makers using statistical techniques or QCA tells us nothing about what is going on within cases. Instead, they only allow us to observe patterns of (in)variation across cases.

For instance, a QCA model might show that condition C is irrelevant since the outcome Y appears together with the presence of C (e.g., ABC) and its absence (e.g., AB~C). In short, C is not a difference-maker from a cross-case perspective (Baumgartner & Falk, 2019; see also Chap. 7). However, once we move down to the case level, the presence or absence of C might be causally relevant for the operation of the mechanism as it still constitutes an analytically important context in which the causal mechanism is embedded (Álamos-Concha et al., 2021; Beach et al., 2019; Schneider & Rohlfng, 2019). The same holds for variables that turn out as (in)signifcant in regression analyses. All that regressions say is that X has, on average, a particular effect Y, or that it does not; but whether a given factor impacts how the mechanism operates within a given case is an entirely different question that can only be addressed through means of within-case analysis, as this information cannot be derived from the statistical effects (Goertz, 2017; Seawright, 2016; Weller & Barnes, 2014).

In sum, issues like context-sensitivity, proper scoping, or omitted factors as a source of causal heterogeneity are widely acknowledged in the literature discussing various forms of cross-case and within-case methods. From the perspective of MMR and the task of generalizing causal mechanisms, the problem is aggravated since researchers need to be aware of the limited homogeneity beneath the effect of X on Y and the possibility of multiple mechanisms connecting X and Y across subsets of cases.

#### *10.4.3 Causal and Temporal Dynamics*

A third problem when generalizing insights about the working of mechanisms in MMR is that an X/Y relation identifed at the cross-case level usually tells us (next to) nothing about the underlying causal and/or temporal dynamics. A look at the literature on within-case studies and MMR discusses a variety of different dynamics that can lurk underneath the same X/Y relationship (Beach & Rohlfng, 2018, 18–25; Beach et al., 2019, 125–28; Blatter & Haverland, 2012, 94; Falleti & Mahoney, 2015, 217; Goertz, 2017, 123–69; Grzymala-Busse, 2011, 1275; Mikkelsen, 2017, 429–34; Weller & Barnes, 2016, 434–35). If unnoticed, they can have a tremendous impact on the generalizability of mechanistic claims since the researcher would assume that the same patterns are linking X in Y in all cases while, in reality, they differ across cases.

One example of mechanistic heterogeneity that can hide behind the same X/Y relation is the temporal sequence of conditions and mechanisms. For instance, a cross-case analysis based on QCA or standard regression techniques might indicate that three factors A, B, C are associated with Y. For illustrational purposes, we use the example of large refugee fows, transnationally operating rebel groups, and other negative effects such as an increase in drug traffc, terrorism, and economic downturns that provoke a robust UN intervention. We can envisage a case where the three factors follow a temporal sequence, according to which the rise of transnational rebel groups (B) frst causes an increase in refugee fows (A), which then leads to economic downturns and other negative consequences (C), which fnally causes a robust UN humanitarian intervention. Can we now assume that the same sequence is present in all cases? This would probably be a pretty heroic assumption, since many other sequences can still be plausibly theorized. For instance, it might be the case that all three factors appear simultaneously, or the ordering of conditions might be different.

Interaction patterns might be another way that mechanistic heterogeneity manifests itself. For instance, mechanisms might work independently versus conjointly in different cases. Revisiting the example again, the increase in refugee fows, the rise of transnational rebel groups, and negative effects such as an increase in drug traffc, terrorism, and economic downturns might each trigger separate causal mechanisms through different actors and venues that ultimately lead to UN interventions. In other words, A leads to Y, B leads to Y, and C leads to Y through three independent causal mechanisms CM1, CM2, and CM3. However, in other cases, we might fnd a different situation. One reasonable alternative might be that the three factors do not show an independent effect, but instead work conjointly, so that each causal mechanism adds or reinforces each other until the UN decides on a robust humanitarian intervention.

It is important to note that these challenges cannot merely be fxed by including interaction terms in regressions or using confgurational methods like QCA.6 Regarding the latter, conjunctions in QCA only tell us that two or more conditions are jointly associated with an outcome; however, they do not tell us anything about the interactions present among the individual conditions within the confguration. Yet the same applies to interaction terms in a regression analysis where we learn that a factor's average causal effect depends on the level of another factor; however, this contains no information on what dynamics and interplays we should expect at the level of mechanisms.

#### **10.5 Taking Mechanistic Heterogeneity in MMR More Seriously**

In all the situations described in the previous section, generalizing from one studied case to other cases that have not been studied risks making fawed inferences about which causal mechanisms are operative in different cases. Strictly speaking, we can only know which mechanisms are operative in a given case by investigating that case. This means that researchers are confronted with an inherent trade-off when establishing the external validity of mechanistic claims: examine all cases within a given population at tremendous analytical costs, or make a mechanistic generalization based on hope, with no empirical evidence to substantiate it (Khosrowi, 2019). The trade-off is of special relevance to public policy, where the complexity of processes in different contexts (both across space and time) makes mechanistic heterogeneity likely pervasive.

To engage with this inherent trade-off, we propose a generalization strategy that pays close attention to mechanistic heterogeneity using a sequential, 'cross-case analysis frst/within-case analysis second' design. Building on the work by Weller and Barnes (2014, 2016), we advise engaging in multiple follow-up case studies that assess which causal mechanisms are present in strategically selected cases within a population, thereby gradually establishing the boundaries of the external validity of our mechanistic claims. In situations where we fnd mechanistic heterogeneity, we should map the different causal mechanisms operating in various subsets of the population to clarify why different mechanisms are operative in different

<sup>6</sup>Techniques like mediation analysis, structural equation modeling, or coincidence analysis offer a partial remedy by mapping (causal) chains and sequencing factors. However, other aspects, like whether the speed of events infuences the unfolding of complex dynamics between multiple mechanisms, remain open. Additionally, the other sources of mechanistic heterogeneity still play a role.

sub-sets of cases (see Beach et al., 2019, 133–54 for a more detailed discussion) (Box 10.4).

#### **Box 10.4: Strategy for Testing the Generalizability of Mechanisms Under the Assumption of Mechanistic Heterogeneity**

The rationale behind the suggested snowballing-outwards procedure is to use fndings from within-case analysis to revise the knowledge of the boundaries in which particular mechanisms are operative and progressively update the confdence in the external validity of the mechanistic claims which can and which cannot be made.The proposed strategy consists of the six steps, starting after the cross-case analysis has produced a robust X/Y relationship:


After a robust X/Y relationship is identifed at the cross-case level via statistical or confgurational methods, the frst step of the proposed generalization strategy starts with theoretically unpacking various potential mechanistic explanations. Unpacking mechanisms involves disaggregating causal processes into parts composed of actors doing things.7 What is necessary at this stage is that researchers make the causal logic underlying the linkages in a mechanism explicit. Doing so also sheds light on all kinds of factors (causal and contextual) that we might expect to be relevant for whether and/or how a given mechanism works. For instance, one pathway might include a part where, to table a proposal that frames a debate, the expert needs to be a trusted epistemic authority by the policymakers. In fact, by theorizing and empirically tracing how a mechanism works, we also shed light on the conditions required for it to work in a particular way.

<sup>7</sup>How to defne causal mechanisms is debated within the methodological literature (instead of many, see Beach & Pedersen, 2019; Bennett & Checkel, 2015). Although we cannot get into detail, the problem of generalization and mechanistic heterogeneity is independent of whether one follows a productive account or envisages causal mechanisms rather in terms of intervening factors or very abstract one-liners.

Of course, throughout the next steps, one should still cast the net widely and be open for further evidence about causal mechanisms which have not been hypothesized at this early stage; however, the frst step should include a theoretical mapping of the most plausible different mechanistic scenarios and the respective settings in which they might occur.

In the next step, a cross-case mapping of the potential population of cases is undertaken. This involves scoring cases based on values of the explanatory factors X and the outcome Y and potential contextual and causal conditions that might affect how mechanisms work. Here it is crucial to go beyond the identifed X/Y relations and to include all analytically relevant (causal or contextual) conditions. In principle, it should be the goal of this mapping to identify clusters of cases as causally homogeneous as possible to minimize the a priori risk of mechanistic heterogeneity.8

Based on this mapping, we can select a case for tracing the underlying mechanisms between X and Y. At the initial stage, all positive cases that are members of the X(s), Y, and the given context are potential candidates for process tracing since mechanisms can only be observed in cases where X and Y are present. Ideally, this process tracing identifes one or several mechanisms linking X and Y in a given context C.

However, it might also be the case that no mechanism is identifed in the chosen case. Here, we would advise proceeding to another similar case study and checking whether there is also no mechanism linking X and Y. If this is the case, the evidence points towards a mere correlation. Additionally, it could also be that the process tracing reveals one or more contextual factors that impact the working of the mechanism(s), but have not been considered so far. These new contextual features should then be added to revise the mapping of the cases and defne more homogeneous subsets.

Based on this initial process tracing of one case, if resources allow it, we should conduct a second study of a case that is as similar as possible on as many relevant causal and contextual factors with the initially studied case. Finding a similar mechanism(s) operative in the second case increases our confdence that the process works similarly across cases. This way, we reduce the risk of missing important factors that might impact how the mechanism works. If, on the other side, we fnd a different (or no) mechanism(s) operative in a similar case, we would need to look for omitted conditions that differ between the two cases and which explain the difference in the underlying mechanism.

The exploration of mechanistic heterogeneity then continues by strategically selecting more and more different cases to identify the boundaries within which the mechanism operates. When we fnd different mechanisms operative, we would then want to assess what conditions differ between the cases to understand under which conditions different mechanisms are operative.

<sup>8</sup>To make the mapping compatible with mechanistic explanations when working in variance-based designs, qualitative thresholds for all explanatory, contextual factors, or analytical dimensions need to be established at which a specifc mechanism is expected to trigger.

This exercise of empirically testing for mechanistic heterogeneity should be done with an eye to those sources which seem particularly problematic for the research design. For instance, if one of the main explanatory factors is operationalized via a complex concept, one should check whether different causal attributes impact the unfolding of a mechanism. Similarly, researchers should pay close attention to potential interactions, sequencing, and other dynamics among mechanisms that are hidden behind simple X/Y relationships if there is some theoretical or empirical argument that would lead researchers to expect this. In other words, instead of assuming that the same causal mechanism is present in all cases showing X and Y, we encourage researchers to look beyond the results of the cross-case analysis and leverage additional theoretical and empirical insights and probe whether the mechanistic homogeneity or heterogeneity is present in their MMR design.

#### **10.6 Concluding Remarks**

One reason for the popularity of MMR is that its main objective coalesces with the evolving consensus in the social sciences that strong causal explanations require evidence of an association between X and Y and evidence for the underlying causal mechanisms between X and Y. The main objective of this chapter was to familiarize researchers with the notion of mechanistic heterogeneity and the challenges this causes when conducting MMR based on some type of cross-case analysis in combination with some form of within-case method. After discussing some basic logics of MMR, we introduced the idea of mechanistic heterogeneity. We highlighted several sources that can bring about causal heterogeneity at the mechanism level in MMR designs. We contend that mechanistic homogeneity is typical when conducting social science research. Starting from the assumption that the social world is characterized by causal complexity, which might be present both at the cross-case level and the within-case level, we must pay more attention to mechanistic heterogeneity when making generalizations about mechanisms. Otherwise, we risk ending up with fawed inferences about the working of causal mechanisms across a sample of cases.9

Assuming causal homogeneity at the level of mechanisms makes MMR designs considerably easier. But, as tempting as it might sound, we simply do not know a priori whether this assumption is correct in any given MMR design which strives to integrate insights derived through within-case studies and results from a cross-case analysis. To put it more bluntly, "[...] merely assuming that populations are similar at lower levels would amount to an extrapolation based on hope" (Khosrowi, 2019, 48). Against this backdrop, we call upon researchers to do better than assuming mechanistic homogeneity. Instead, we engage in empirically testing the limits to

<sup>9</sup>Looking beyond the social sciences, causal heterogeneity at the level of mechanisms also plays a crucial role in the life sciences, as the discussions in Steel (2008) and Wilde and Parkkinen (2019) highlight.

which we can generalize mechanistic claims, transparently map out the presence of mechanistic heterogeneity, and establish the proper boundaries for the generalization.

The debate about how to achieve this goal is just beginning. We hope that the guidelines and insights presented in this chapter help to improve research practices and encourage more explicit guidelines on how to address mechanistic heterogeneity while deploying different combinations of methods.

#### **Suggested Readings**


#### **Review Questions**


• Make a list of advantages and disadvantages that come with the strategy that maps and tests the boundaries for generalization in multimethod research. Discuss whether the additional efforts justify the proposed gains. Is generalizing mechanisms based on hope a better strategy from your perspective?

#### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

# **Chapter 11 Conclusions. Causality Between Plurality and Unity**

**Alessia Damonte and Fedra Negri**

**Abstract** The previous chapters convey the image of causal analysis in public policy and beyond as a fragmented feld where research communities seldom learn from each other's fndings. This chapter resumes the ontological, epistemological, and methodological evidence that causal analysis is characterized by a plurality of objects and "incommensurable" interpretations. It also argues that the same evidence pinpoints how this plurality is complementary at every level, and causal structures raise as the elements that link ontology and methodology and can organize heterogeneous fndings to improve learning across accounts.

#### **Learning Objectives**

After reading this chapter, you will:


#### **11.1 Introduction**

As Daniel Little pinpointed in Chap. 2 and Leonce Röth and Andrew Bennett elaborated in Chaps. 6 and 8, the social sciences are home to a variety of understandings of "causation"—regularity, counterfactual, manipulability/interventionist, mechanistic—that have molded research with their particular defnitions, methodological commitments, techniques of choice and often a claim of priority over alternatives. In Chap. 10, Markus B. Siewert and Derek Beach warned that, notwithstanding the

A. Damonte (\*)

University of Milan, Milan, Italy e-mail: alessia.damonte@unimi.it

F. Negri University of Milan-Bicocca, Milan, Italy e-mail: fedra.negri@unimib.it

optimistic expectations from the mixed-method quarters, these understandings seldom make research strategies suitable to refne each other's fndings, for each sheds its light on the phenomena of interest from a particular height and angle. Therefore, causal analysis looks fragmented into discrete approaches, each yielding its piece of knowledge that seemingly cannot speak to the others.

This chapter asks whether such fragmentation is unavoidable, undesirable, or both. To fnd its answer, it proceeds in two steps. Section 11.2 introduces two opposite accounts of how science is made. One maintains that fragmentation is an undesirable state of "confusion of tongues" and science can only advance under a dominant paradigm pursuing the unifcation of disciplines by reducing research felds "all the way down" to a few fundamental objects. The other considers that the independence of the research felds makes reduction unnecessary and the variety of research interests makes it highly undesirable; nevertheless, some learning can pragmatically happen as for a wanderer that updates her map along the way. Section 11.3 considers whether the state of the art in causal analysis fts the confusion of tongues or the wanderer metaphor along three dimensions—the ontological, the epistemic, and the methodological. Section 11.4 concludes that the feld is intrinsically plural in every dimension; however, accounts are complementary, and causal structures can offer common points of reference for organizing fndings into dovetailing portrayals of the "causal elephant."

#### **11.2 Two Tales About the Making of Science**

A captivating narrative maintains that science is made in the tension between the two poles of unity and plurality of research mindsets. However, the story turns in different directions depending on one's viewing angle.

#### *11.2.1 The Viewpoint of the History of Science*

The frst version builds on the idea that science is a social creation and takes historical forms (Kunh, 1996; see Wray, 2011; Sankey, 2019). The modern form comprises "disciplines"—such as chemistry, biology, or economics. The term denotes the distinct body of knowledge that anyone must master before claiming expertise on a subject matter. Disciplines are usually maintained by departments and faculties within colleges and universities. Their members research the subject matter, contribute to its defnition by publishing in specialized outlets, and teach courses to train students in the profession. Hence, a discipline arises from the activities of a community committed to some "matrix" of tenets, theories, and practices.

As Thomas Kuhn argues, disciplinary matrixes emerge from the scholarly competition to respond to foundational questions—about the ultimate entities of a research feld, their interactions and organization, and the techniques suitable to know them. A matrix becomes "normal science," the "paradigm" of reference, or the "received view" when it provides a fruitful defnition of some fundamental knowledge problem. Often, such defnition lies in books and articles that become "classics" in force of a few crucial features: They offer a successful synthesis of previous efforts, restate the legitimate problems of a feld, and leave several questions open for research while establishing the method to tackle them (Kuhn, 1996: 10). As more people are trained to address its questions with the methods of reference, old or alternative approaches are "read out of the profession" (*ivi*:19). As a result, the winning matrix dwarfs its competitors and dictates the agenda. In the short run, normal science simply neglects those research issues that do "not ft the box" (*ivi*: 24). In the long run, however, the cumulation of intractable "anomalies" puts normal science into crisis and opens a stage of "extraordinary research" (*ivi*: 90). Possibly, the stage results in a "revolution" and the emergence of a new normal.

In short, this theory assumes that ideas in science follow evolutionary dynamics and tend toward a single equilibrium point at a time. This assumption rests less on evidence about disciplinary trajectories than on prescriptive considerations. Indeed, Kuhn (1996:18) shares with Francis Bacon the tenet that "truth emerges more readily from error than from confusion": Science under a single dominant paradigm, albeit limited in its grasp of the world, is preferable to science under competition. As Kuhn argues, competing disciplinary matrices grow "incommensurable" to one another. In turn, incommensurability makes disciplines "immature" and incapable of relevant advancements.

The obstacle, to Kuhn, is mainly semantic. A competing matrix develops scientifc terms that are only meaningful within its original vocabulary, as each term is minted to connect some phenomena to particular theories. Thus, theoretical terms become idiosyncratic lexical constructs and create a specifc classifcation of the subject matter that proves irreducible to any other. Out of the shadow of a dominant paradigm, the scientifc discourse proceeds in a confusion of tongues, and the debate across communities unfolds as zero-sum confrontations.

#### *11.2.2 The Perspective of the Philosophy of Science*

From the viewpoint of the philosophy of science, the divide runs between "monism" and "pluralism" instead, and the two are understood as research agendas with alternative motivations but of ultimate equal standing.

The monist agenda revolves around the core tenet that "the ultimate aim of a science is to establish a single, complete, and comprehensive account of the natural world (or the part of the world investigated by the science) based on a single set of fundamental principles" (Kellert et al., 2006: *x*). Corollaries of monism are that, at least in principle, such a comprehensive account can describe or explain the world faithfully and strategies of inquiry exist that can produce such a comprehensive account. Scientifc monism then turns reducibility into a yardstick to assess the worth of methods and theories: "methods of inquiry are to be accepted based on whether they can yield such an account"; moreover, "individual theories and models in science are to be evaluated in large part based on whether they provide (or come close to providing) a comprehensive and complete account" (*ibidem*).

Just the opposite, scientifc pluralism advocates for an open mind on the nature of causes. It maintains that "there are no defnitive arguments for monism and that the multiplicity of approaches that presently characterizes many areas of scientifc investigation does not necessarily constitute a defciency" (Kellert et al., 2006: *x*). In principle, pluralism does not deny the possibility that an encompassing account of the world can be found that effectively allows reducing complexity to the same objects "all the way down." However, it addresses this possibility as an empirical matter decided by evidence that may never prove conclusive.

Besides, the coexistence of various accounts across and within disciplines does not undermine the standing of the knowledge so yielded. Crucially, pluralism commits to maintaining that theories and methods cannot be rejected as "unscientifc" on the grounds that they fail to reduce complexity to the same fundamental principle (e.g., Fodor, 1974; Longino, 2013). Pluralism fnds the reason for incommensurable approaches in the diversity of the research questions that can be asked. Considerations about the relative autonomy of research felds (e.g., Dupré, 1993), the irrelevance of reducibility to the validity of fndings (e.g., Suppes, 1978), and the dappled nature of the world (e.g., Cartwright, 1999) further reinforced the stance. In short, phenomena might be "too complicated or too indeterminate and our cognitive interests too diverse for the monist ideals" (Kellert et al., 2006: *xi*).

Nevertheless, these considerations do not license the conclusion that literally "anything goes." Paul Feyerabend (1993) minted that dictum as the single pluralist principle in a Dadaist mockery of monism—given that, as such, scientifc pluralism remains skeptical about the possibility of single fundamental principles in doing science. Instead, the dictum calls for recognizing that any approach has its limits, even when it seems unquestionable. Therefore, science advances when its rules make room for a pragmatic conversation between theories and evidence of any stripes, as a wanderer that updates her map along the way (*ivi*: 223 ff).

#### **11.3 Can We Learn from One Another?**

Both the confusion of tongues and the wanderer metaphors ft the causal landscape of policy studies and social sciences, leaving the question open of whether pragmatic learning can happen across the research communities that inhabit them or strict incommensurability reigns instead. The issue can be addressed along three conventional lines (e.g., Della Porta & Keating, 2008): the ontological, the epistemological, and the methodological.

#### *11.3.1 Ontological Incommensurability?*

Causal ontologies are assumptions about the kinds of ultimate "objects" in a causal account. They are crucial as they indicate where causal analysis legitimately "bottoms out" while avoiding the chasm of infnite regress or circularity. However, the concept has long proven contentious, as it can mean a commitment to dogmas that outweigh evidence instead of some ground for meaningful methodological choices (e.g., Woodward, 2015; see also Damonte & Negri, Chap. 1).

As discussed by Daniel Little and Andrew Bennett in Chaps. 2 and 8, of the four approaches to causality (i.e., regularity, counterfactual, experimental, and mechanistic), the mechanistic stands out as it offers a convenient ultimate ground. Beyond evading infnite regress and circularity, mechanisms can prevent causality from being reduced to non-causal objects such as constant conjunctions or methodological criteria such as counterfactual reasoning. Without some mechanist account of the nature of the process that generates the observed outcome, non-causal objects are analytically unsatisfying and offer a rough guide to policy choices. As Eric Battistin and Marco Bertoni discussed in Chap. 2, the experimental approach aims at getting as close as possible to causal identifcation by manipulating the candidate causal factor under controlled conditions. However, the credibility of the fndings obtained through manipulation stems from the credibility of the assumptions about the background whence, as Leonce Röth adds in Chap. 6, unknown confounders can operate that bias causal identifcation. Mechanisms provide testable hypotheses about the relevant covariates in the background, hence make sense of regularity and circumscribe counterfactual reasoning about the outcome to limited regions of the world (e.g., Cartwright et al., 2020; Glennan, 2017; Illari & Williamson, 2012; Machamer et al., 2000; Salmon, 1994).

Scholars from theory-driven areas fnd mechanistic assumptions easy to embrace (e.g., Peters, 2022; Dowding & Miller 2019; Busetti & Dente, 2018). The approach is also increasingly accepted within research communities concerned that substantive assumptions may impress biases in conclusions (e.g., Imbens, 2020; Imai et al., 2013). However, the literature contends that the concept can be elusive and its defnitions at cross purposes (e.g., Mahoney, 2021; Mayntz, 2020; Seawright 2018; Goertz, 2017; Gerring, 2011; Pearl, 2000; Holland, 1988; see also Little, Chap. 2, Röth, Chap. 6, Bennett, Chap. 8, and Beach & Siewert, Chap. 10 in this volume).

Against this backdrop, Wesley C. Salmon (1987, 1994; Dowe, 2000; see also George & Bennett, 2005) provides an encompassing defnition that also proves sensitive to the many desiderata in causal ontologies. His starting point is Bertrand Russell's grasp of causality as the seamless "persistence of something" across space and time (1948:459). To preserve the emphasis on the factual side of causation while improving the ability to distinguish it from non-causal phenomena, Salmon borrows from the physical understanding of energy and defnes causality as the seamless transmission of some non-null "conserved quantity" across space and time.

As such, causality is singular and inheres to entities as different as still paperweights, thrown baseballs, sent data packets, enacted policy instruments, or engaged strategic actors. Moreover, it exists in the time window between two distinct alterations, regardless of how narrow that window seems to an observer. In turn, alterations occur at *intersections*—the concept that allows discriminating between causal and non-causal transmission processes.

Following Hans Reichenbach (1956), Salmon identifes three possible alterations that a causal quantity can undergo when intersected:


The movement of the conserved quantity across time and places is the "causal rope" connecting two intersections; the other way round, intersections are the starting and the ending point of any specifc causal rope. Albeit the "causal elephant" only arises in force of both, it can be addressed as either the causal line of a conserved quantity or as its λ, γ, and χ generation structures.

These complementary viewpoints make the mechanistic ontology intrinsically plural. Indeed, the transferral of "conserved quantities" and linked intersections require different vocabularies to be spoken of. However, each account implies the other—which, in principle, makes room for pragmatic matching and learning. Whether this happens, however, depends on epistemic conditions.

#### *11.3.2 Epistemic Incommensurability?*

The epistemic level comprises the responses to the question of how we know causation. The question implies a further broad distinction between "foundationalists" (e.g., Christensen, 2004; Kaplan, 1994) and "naturalists" (e.g., Kornblith, 1980; Quine, 1969; cfr. Bevir & Kedar, 2008). In the former camp, the main question is how we *should* know causation. The response builds on a vision of scientifc epistemology as rules and standards deployed to establish cogent evidentiary arguments. Scholars in the latter camp instead focus on *how it happens* that human beings know causation. They share an interest in knowledge as individual and social belief systems shaped by psychological and interactive sense-making processes.

The plurality of the positions within and across camps is mirrored by the many interpretations of probability deployed over time. Probability turns our conjectures about "something" being such and such instead of anything else into explicit and inspectable conditional relationships (e.g., Hàjek, 2007). Such conditionality supports our efforts to predict or retrodict events and make decisions even when our understanding of their determination is limited, our information is partial, or the world appears indeterminate. However, the same conditionality can afford a large number of readings. Gillies (2000:1; cfr. Weatherford, 1982; Fine, 1973; Kyburg, 1970; Salmon, 1966) identifes four major interpretations:


The logical and the subjective interpretations are often grouped together for their shared focus on human heuristics. In contrast, the frequentist and the propensity readings both assume that probability is independent of the single individual mind which, customarily, qualifes it as "objective." However, the propensity interpretation differs from the pure frequentist: The latter limits itself to "collectives," while propensity makes room for the conditional probability of individual events. As a consequence, frequentists tend to commit to parametric analysis to preserve accuracy in estimates, whereas propensity interpretations usually support non-parametric procedures and, as such, trade accuracy for the fexibility afforded by weaker or no assumptions about the true distribution of the phenomenon of interest.

The expectation camp, too, is easily associated with non-parametric procedures; however, the logical diverges from the subjective interpretation. The former considers information from rational inference structures as a reason for dismissing a relationship between sentences, whereas the latter maintains that the only misleading probability is the inconsistent one. Thus, logical interpretations are concerned with the soundness of the conclusion they license, whereas subjective interpretations allow absurd beliefs about the world as long as the relationship between odds against and in favor meets the formal axioms of probability calculus.

All in all, these interpretations patently ft the confusion of tongues. Radical subjectivist assumptions annoy those who see them as a license to retain fallacies in reasoning (e.g., Hájek, 2007). Propensity is in the odor of metaphysical speculation, and its causal assumptions imply asymmetries that do not ft the standard axioms of probability (e.g., Humphreys, 1985). Deceptive is equally deemed the claim that mathematical a priori tenets – such as the Law of Large Numbers and the Central Limit Theorem, or the classical Principle of Indifference—confer priority to frequentist probability because they render the ultimate nature of the world (e.g., Freedman, 2010). Logical interpretations appear as deductive as the frequentist and, in addition, are charged with entertaining highly implausible assumptions about human heuristics (e.g., van Fraassen, 1989).

However, once again, each interpretation suits a particular research interest and, pragmatically, they all can be deployed to illuminate the whole of the "causal elephant" from different angles and heights. However, this does not imply that the methods through which different interpretations are deployed can yield dovetailing knowledge.

#### *11.3.3 Methodological Incommensurability?*

Ascertaining causation has long been a pluralistic matter and has often provided a substitute for ontological assumptions (e.g., Rohlfng & Zuber, 2021, Brady, 2008; see Little, Chap. 2). As recalled by Alessia Damonte and Fedra Negri in Chap. 1 and elaborated by Daniel Little in Chap. 2, the infuential Humean ideal establishes that a local causal relationship meets two criteria: First, conditions similar to the observed local ones provide the regular antecedents of the outcomes similar to the observed one (i.e., regularity); second, had our local conditions been absent, then the local outcome should have taken a different magnitude or state than observed (i.e., counterfactual). Otherwise said, the methods to ascertain causation can be reduced to the alternative between "enumeration" and "elimination" (e.g., Hintikka, 1968). Notably, each criterion operates at a distinct level:


In moving from an observation to the claim that the observation is causal, the two criteria have long been recognized with different weights. Enumeration can yield lawlike generalizations that capture the robustness of the relationship between kinds across contexts but that, as such, cannot support the claim that the relationship has a causal standing. Barometer readings and storms, hoaxes and salt dissolving in water, birth control pills and biological male pregnancy—all these relationships can pass enumeration, but not elimination. The storm would have occurred had the barometer been broken, the salt would still have dissolved in water if unhoaxed, and Mr. Smith would not have gotten pregnant had he ingested aspirins instead. Thus, elimination better supports the intuition that the relationship is effective and that Salmon's "conserved quantity" yielded the outcome. However, Humean local elimination confronts the long-acknowledged "fundamental problem of causal inference": We cannot rerun history to observe the local outcome in the absence or under different local conditions while holding all the other potential confounders constant (e.g., Holland & Rubin, 1987; see also Battistin & Bertoni Chap. 3, Negri Chap. 4, Ornstein Chap. 5).

#### **11.3.3.1 Design-Based Solutions**

The purposeful selection or construction of observation units as "instances" or "cases" enter as suitable methodological solutions to circumvent the fundamental problem of causal inference by making counterfactuals somehow observable. John Stuart Mill (1843) famously systematized the practices and knowledge of the time into two primary designs plus three elaborations. The two basic designs build on the Humean standards as they proceed:


The three further elaborations state that:


Of the fve canons, the latter only suits continuous-valued phenomena—in all the remaining designs, phenomena are units' binary qualities. Noticeably, the method of concomitant variations also stands out as it cannot establish that the relationship is causal in itself—only that it suggests some causal "fact" (see Negri, Chap. 4).

The other designs are deemed more conclusive as they rely on selected combinations of qualitative diversity in backgrounds, outcomes, and conditions to dismiss the hypothesis that the conditions in the background are relevant to the relationship of interest (agreement) or that the relationship includes causally irrelevant elements (direct difference, indirect difference, and residues). Of the two threats, Mill maintained the latter is more harmful to the standing of the claim that the relationship is causal, which makes difference-based designs more conclusive. Agreement remained the design of reference for studies where the assumptions of the most similar background could prove harder to attain; its double deployment as the indirect method of difference was offered as a strategy to license more credible conclusions.

With a grain of salt, the reasoning behind these canons has been standing the test of time. While comparative strategies seldom made a secret of their debt toward indirect difference as their design of reference (e.g., Mahoney, 2021, also see Damonte Chap. 7), it is also hard not to notice how the estimation of the effect in Randomized Controlled Trials shares the rationale of Mill's residues. The same holds for the weaknesses that Mill himself recognized. Design-based inferences can license claims that a relationship is causal but cannot ascertain its direction, absent further assumptions and information. Moreover, "causes" can prove:

	- A *physical* rationale and result from the algebraic sum of its components pointing in different directions, as in the composition of forces. For instance, someone's calculation about compliance may depend on their preferences for noncompliance and information on how likely the penalty is applied (e.g., Klepper & Nagin, 1989). Or it may be that some catch-22 regulations made the original decision to comply impossible to pursue.
	- A *chemical* rationale and result from interactions raising a qualitatively different outcome. For instance, the individual decision to not comply may prove perfectly rational from the individual perspective in the short term, yet turn into a tragedy when the decision spoils a common good and is made under an institutional design that allows opportunism to spill over (Ostrom, 2009).

To prove that the antecedent has some causal import, difference-based designs have to dismiss plurality and composition as background "noise" or part of some "ceteris paribus" clause. However, without knowing how and under which conditions the causal connection holds, the conclusions are possibly inaccurate as their assumptions about the comparability of instances may not hold (e.g., Dunning et al., 2019; Trampush & Palier, 2016; Morgan & Winship, 2015; Cartwright & Hardie, 2012; Imai et al., 2011; Salmon, 1990; Campbell & Stanley, 1963).

#### **11.3.3.2 Model-Based Solutions**

The increasing attention to causal models responds to the need for testable structural assumptions. It revives the factual side of causal analysis and revolves around a few options, all resonating with Mill's intuition of plural and composite factors but seldom corresponding perfectly.

For instance, Patricia L. Kendall and Paul F. Lazarsfeld (1950; see also Morgan & Winship, 2015) introduce structures to "elaborate" a correlation of interest and so improve its credibility. These structures emerge by stratifying the relationship between X and Y by a multi-value test factor T. Thus, T "interprets" the relationship if it occurs after X but before Y, as in physical composition. Instead, T "explains away" the relationship if it occurs before X and Y—a relationship that Mill would classify as a "fact of causation" without an autonomous shape. The further elaboration "specifes" the relationship by considering the circumstances that affect the partial relationship between X and Y within each stratum of T. Morgan and Winship (2015) note that specifcation implies an intransitive relationship of T with either X or Y, which may resonate with Mill's chemical composition (with X) or plurality (with Y).

Causal structures also are the crux of Pearl (2000; see also Röth, Chap. 6). His approach, too, considers these structures as the solution to the problem of identifcation. The causal standing of a relationship always builds on three terms—the alleged causal factor X, the outcome factor Y, and the additional term Z—arranged in three fundamental shapes and visualized as directed acyclic graphs—the "chain," the "fork," and the "collider." In the chain, Z is the mediator between X and Y; in the fork, it is the common cause of X and Y; in the collider, it is the effect of Y and, independently, of X. Then, the chain corresponds with Mill's physical composition and the collider with Mill's plurality. In Mill's terms, Pearl's fork again is a "fact of causation." Mill's chemical composition, instead, is discussed as the problem of identifying causal intransitivity in chained structural models (e.g., Halpern, 2016; von Sydow et al., 2016; Hitchcock, 2001).

Albeit the confusion of tongues seems to reign again among model-based strategies, here the translation problem does not seem to imply real incommensurability—just blind spots and labeling issues.

#### **11.4 Wrapping Up and Looking Ahead**

This chapter asked whether the different techniques in causal analysis can learn from each other or incommensurability rules instead. The portrayals sketched above suggest that incommensurability hides many complementarities between interests in processes or intersections and between "objective" and "subjective" interpretations of probability. However, interests and interpretations cannot dovetail unless they build on some common ground. Such possible common ground consists of causal structures.

On the one hand, causal structures arise threats to the identifcation of the effect of a single factor that designs aim to keep at bay; on the other, they offer the scaffolding for testable models of how and why the effect occurs. Moreover, causal structures connect methodologies with ontological assumptions – albeit far from perfectly so, as summarized in Table 11.1.

Table 11.1 highlights how ontological and methodological viewpoints shed their unique blind spots on structural alternatives. Mill does not consider the common cause as a proper causal structure, for it raises the spurious correlation that enumerative strategies mistake for causal, while Reichenbach and Salmon seemingly disregard structures that could be labeled "disjoint" as they depend on alternative processes, thus suggesting an analytical focus on one "conserved quantity" at a time. In turn, Pearl's graphs do not identify Mill's chemical composition as a distinct shape—possibly treating it as a path in a fork or a version of the chain structure and as a matter of the debate on how to identify actual instances of intransitive causation from sheer dependence. Last, Kendall and Lazarsfeld develop their typology as explorations of facts of causation.

Beyond the differences in standing and usage, these structures promise to offer the terrain where otherwise diverse research strategies can trade their fndings, provided that they acknowledge the peculiarities of each other's language. Indeed, ideally, structural assumptions can accommodate results generated with different grammar and syntax rules while addressing the same policy concern. Frequentist probability can yield robust estimates of some effect of interest of Salmon's "conserved quantity" and, hence, support decisions on whether the treatment is worth the policy effort. Propensity probability can assess Salmon's intersection or Reichenbach's reference class to yield more fne-graded estimates of the effect in selected subpopulations. The logical probability can establish whether a reference class makes a sound singular account and afford the *ex-post* evaluation of interventions while improving forecasting. Subjective probability narrows on individual expectations and exposes the heuristics beneath our decisions as policytakers and policymakers—which can only be evaluated in light of knowledge and assumptions about logical reasoning and "objective" evidence.

Strategies and techniques create families that can be accommodated into a single low-dimensional space only at the cost of inviting outraged objections. Nevertheless, we are positive that the efforts of the next generation of eclectic causal analyses to elucidate causal structures can contribute to building more integrated multidimensional maps of crucial policy, political, and social phenomena.


**Table 11.1** Causal structures

Source: own elaboration. References in the main text

#### **References**


Mill, J. S. (1843). *A system of logic, ratiocinative and inductive*. Harper & Brothers.


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.